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1 . I am a scientist and statistician at deCODE genetics, ehf. ("deCODE"), 
having worked at deCODE for 9 years. My curriculum vitae is attached hereto. 

I. INTRODUCTION 

2. I have been Group Leader for the Cardiovascular Disease Group in the 
Statistical Department at deCODE since 2001 . I was involved in the the research 
team at deCODE that identified at risk-variants for myorcardial infraction (MI) within 
the gene encoding 5-lipoxygenase activating protein (FLAP). I am providing this 
declaration to make available to the United States Patent and Trademark Office (PTO) 
a further explanation of the data supporting the association of variants within the 
FLAP gene with with increaased risk of Myocardial Infarction. 

II. OBSERVATION AND ANALYSIS REGARDING THE RELIABILITY 
OF GENETIC ASSOCIATION STUDIES 

3. I have read the PTO's examination reports dated July 21, 2006; April 
16, 2007; and January 4, 2008, prepared by Examiner Goldberg. The information and 
opinions that I provide in this section pertain to the PTO's positions in the 
examination reports about the unpredictability and alleged irreproducibility of genetic 
association studies. 

4. Historically, the field of human genetics has been plagued by claims of 
association findings, which have on scrutiny not held up in replication studies. There 



are many reasons for this lack of replication, including poor experimental design, 
population stratification and lack of power in the initial discovery and/or replication 
studies. However, important advances have been made in the last few years, and in 
particular in the last couple of years. With the discovery and validation of very large 
numbers of Single Nucleotide Polymorphisms (SNPs), a resource for genome-wide 
searches for SNPs associated with human disease has been generated (see 
http://www.hapmap.org). Technological advances have also made highly accurate 
genotyping of up to one million SNPs on a single chip possible. Finally, experimental 
design in genetic association studies has been drastically improved, and there is a 
general consensus requirement in the scientific community that all initial findings 
should be replicated (i.e., confirmed) prior to publication. 

5. I would like to clarify the concept of "statistical power" in genetic 
association studies, as this is important to understand why some studies are less likely 
than others to replicate an observed genetic association, even though the underlying 
association is real and applies to all populations. Power is simply the probability of 
finding a statistically significant difference between cases and controls in a particular 
study, given a real underlying difference between cases and controls. The power of a 
genetic association study depends on several factors, most important being the size of 
the study populations (e.g., cohort size; numbers of cases and controls), but also the 
effect size or the odds ratio (OR) of the genetic variant and the frequency of the 
variants. "Odds ratio" is a measure of effect size; it can be defined as the ratio of the 
odds of an event occurring in one group to the odds of it occurring in another group. 
An odds ratio of 1 implies that the event is equally likely in both groups. When an 
odds ratio is greater than 1 when comparing cases to controls, it implies that a 
particular variant is more likely to be found in the case group than in the controls. 
The power of an association study increases both with increased cohort size and larger 
OR of the variant. Hence, for variants with a small effect, much larger study groups 
are needed to detect (with statistical significance) the association than for a variant 
with a large effect. As an example, given the estimated effect of HapA based on the 
replication groups, OR = 1.11, and assuming that the population frequency of HapA is 
14%, a study group of 1,000 cases and 1,000 controls only has about 21% chance of 
detecting a significant difference in the frequency of the HapA haplotype between MI 
cases and controls (for the power calculation method, see e.g. Rice JA, Mathematical 



Statistics and Data Analysis, 2 nd ed, Duxbury Press, Belmond, California (1995)). If 
the study group is increased to 5,000 cases and 5,000 controls, the power is increased 
to a 74% chance of detecting significant association to HapA. 

6. A large number of studies meeting established criteria of statistical 
significance have been published in the last 2 years, many of which are multicenter 
Genome- Wide Association (GWA) studies (reviewed in Pearson TA & Manolio, TA, 
JAMA 299:1335-1344 (2008); Bowcock, AM, Nature 447:645-46 (2007); Altshuler, 
D & Daly, M., Nature Genet 39:813-815 (2007); Kruglyak, L, Nature Reviews Genet 
9:314-18 (2008); Kingsmore, S.F., et al., Nature Reviews Genet 7:221-230 (2008); 
Frayling, TM, Nature Reviews Genet 8:657-662 (2007)). Through GWA studies, 
nearly 100 loci for up to 40 common diseases have been identified and confirmed 
(Pearson TA & Manolio, TA, JAMA 299:1335-1344 (2008)). 

7. A common theme for most of the risk-associated genetic variants thus 
identified is that the risk conferred by each variant is typically relatively small, e.g., 
usually a relative risk in the range of about 1.1-1.5 (Bowcock, p. 646; Pearson, p. 
1341, Altshuler, p. 814; Frayling, p.659)). For some diseases, a large number of 
variants have been identified. A good example is Type 2 diabetes, which was 
previously referred to as the "geneticist's nightmare" due to the lack of identification 
of genetic variants for the disease. As summarized in Frayling and in Zeggini et al 
(Nature Genetics Mar 30 [Epub ahead of print] (2008)), a total of 1 7 genomic regions 
that alter the risk of Type 2 dibabetes have been identified and solidly validated. This 
has been possible by combining results from multiple centers individually 
investigating the genetics of Type 2 diabetes in a joint meta-analysis of available data 
(see Zeggini et al). What is even more remarkable is that with the exception of the 
common variant in the TCF7L2 gene originally identified by scientists at deCODE 
genetics, the risk for these genetic variants is quite small, ranging from 1.09 to 1.20 
per copy of the variant (the risk conferred by the TCF7L2 variant is about 1 .35 per 
copy). 

8. The plethora of variants for the common diseases identified in the last 
couple of years has thus established that genetic risk for most common diseases 
appears to be modulated through multiple low-risk variants which, in any given 
individual, act in a concerted manner, together with environmental risk factors, to 



establish overall predisposition for the particular disease. Thus, scientists in the field 
are accepting the validity of genetic association studies that demonstrate low, but 
statistically significant risk in replication studies. 

9. The Patent Office has cited an article by Ionnidis that allegedly states, 
"As a general rule of thumb we are looking for a relative risk of three or more [before 
accepting a paper for publication], particularly if it is biologically implausible or if it's 
a brand new finding." (See, e.g., Office action dated January 4, 2008, at p. 11 .) The 
papers that I cite above - from well-respected journals - demonstrate that relative risk 
of three or more is NOT an accepted threshold criteria in the field for publication. 
Nor does it reflect the current thinking or data involving the effects of genetic 
variation and common diseases. 

III. THE RELATIVE RISK SCORES IN THE PRESENT APPLICATION 

ARE CONSISTENT WITH RISKS REPORTED IN THE LITERATURE 
AND GENERALLY ACCEPTED AS "REAL" RISK SCORES IN THE 
FIELD. 

10. The method used by Dr. Manolescu in his meta-analysis of available 
data, and by Dr. Helgadottir in her meta-analysis, represents standard methodology 
commonly used to combine results from multiple genetic association studies. The 
Mantel-Haenszel model that they used (Mantel and Haenszel, 1959; Woodward, 
2005) is designed to deal with the situation where association results from different 
populations, with possible different population frequency of the genetic variant, are 
combined. In that case, the model combines the results assuming that the effect of the 
variant on the risk of the disease, as measured by the OR, is the same in all 
populations, while the frequency of the variant may differ between the populations. 

11. I am familiar with the contents of the declarations provided by Anna 
Helgadottir (executed 22 January 2007) and Andrei Manolescu (executed 16 October 
2007) previously officially filed with the PTO, as well as the Table informally 
provided for interview purposes, before the Manolescu declaration was completed. 

1 2. The Tables provided in the Declarations (and interview) describe 
results of "meta-analysis" of association data for a four-marker FLAP haplotype 
(called "HapA") with MI. Meta-analyses are commonly used as a quantitative way of 
combining the results of several genetic association studies, so as to provide an 



overall estimate of the underlying genetic effect on phenotype. Such analyses are 
especially valuable to achieve acceptable statistical power when genetic effects on 
phenotype are small (e.g., in the range of increased relative risk of 1 . 1 or 1 .2), because 
individual studies may be insufficiently powered (due to limited sample size) to detect 
the true genetic effect. Even in the absence of other factors such as poor study design, 
population stratification and phenotype heterogeneity, the results of individual 
association studies are expected to fluctuate due to inherent fluctuations in sampling: 
If one examines an ensemble of studies, such fluctuations due to sampling are larger 
with respect to smaller and less powered study groups. One way for designers of 
studies to address the statistical reality of such sampling fluctuations is to use 
sufficiently large cohorts for the association study. Alternatively, it is possible to 
achieve a large effective sample size by combining results from several small studies. 
However, if the study cohorts that are combined come from different populations, 
then the population frequencies of the genetic variant tested may differ between the 
studies, and appropriate (statistically accepted) methods commonly used in meta- 
analysis must be applied when combining results from different studies. We at 
deCODE used such statistical tools to perform the meta-analysis presented in the 
declarations of Dr. Helgadottir and Dr. Manolescu, as I explained above in the 
preceding section of this declaration. It is important to note that, while individual 
studies may not yield significant results, the combined results in such meta-analyses 
can be very significant if there is consistency in the direction of the observed effect 
across the studies included in the analysis. 

IV. EXPLANATION OF DIFFERENCES IN THE DATA SETS AND 
STATISTICAL ANALYSES IN THE HEGADOTTIR AND 
MANOLESCU DECLARATIONS. 

13. The Examiner notes several differences in the Tables provided in the 
declarations of Dr. Helgadottir, Dr. Manolescu, and in the interview Table provided 
July 31, 2007. The first thing to note is that the interview Table from July 31, 2007, 
reports two-sided P-values, while the two Declarations report one-sided P-values. 
Correcting for this, the interview Table agrees with the Table provided in the 
declaration of Dr. Manolescu. The choice of two-sided P-values in the interview 
Table is unnecessarily conservative, since the analysis is testing a specific effect with 



known direction (i.e., increased risk ). In such circumstances, it is statistically 
appropriate to report a one-sided P-value. 

14. The differences between the analysis presented in the declarations of 
Dr. Helgadottir and Dr. Manolescu, that involve the results for the case/control groups 
from Philadelphia, Cleveland, Atlanta and Durham, arise primarily due to the fact that 
the analyses presented in the two declarations were performed by deCODE at 
different time points, using in each case the most recent phenotype information and 
genotype data available at that time. 

15. More specifically, for the Philadelphia cohort the number of controls, 
and the calculated P-values differed in the declarations because, subsequent to the 
analysis of Dr. Helgadottir, updated phenotype information revealed that three of the 
controls had coronary artery disease. Hence, these three controls in the dataset used 
by Helgadottir were excluded from the analysis of Dr. Manolescu, as individuals with 
known coronary artery and related diseases are excluded from all the control groups. 
Although this change had a very small effect on the estimated frequency of the 
haplotype in controls, the change led to a small change in the P-value presented in the 
analysis of Dr. Helgadottir and Dr. Manolescu, respectively. 

16. For the Cleveland cohort, the differences between the two analyses 
also are due to updated phenotype information. In the updated information, five 
additional individuals were reported with MI and hence included in the "cases" group 
in the analysis of Dr. Manolescu. Three of these individuals with MI had been 
included as controls in the analysis presented in the declaration of Dr. Helgadottir. 
Additional phenotype information also showed that several of the controls used in the 
analysis presented in the declaration of Dr. Helgadottir had been diagnosed with 
either abdominal aortic aneurysm or peripheral artery disease, both of which are 
considered atherosclerotic diseases. Hence, those individuals were excluded from the 
analysis of Dr. Manolescu. 

1 7. For the Atlanta cohort, the differences between the two declarations 
result from both updated phenotype information and additional individuals genotyped 
for the HapA variants. The additional genotypes explain why there are 762 cases 
used in the analysis of Dr. Manolescu compared to the 713 included in the analysis of 



Dr. Helgadottir. Updated phenotype information on the history of cardiovascular 
diseases for individuals used as controls in the analysis of Dr. Helgadottir led to the 
exclusion of some individuals from the control group, as these individuals reported 
either a history of coronary artery disease or related diseases. 

18. For the Durham cohort, additional genotyping of the Hap A variants led 
to the inclusion of additional 484 cases and 250 controls in the analysis of Dr. 
Manolescu, compared to the analysis of Dr. Helgadottir. 

1 9. Thus, the difference between the analysis presented in the two 
declarations only arises as each one is prepared using the most updated and accurate 
data available to deCODE at the time when each analysis was done. Moreover, while 
these differences in datasets did result in different numerical calculations, the 
conclusion about the genetic correlation to be drawn in each instance was the same 
and statistically significant. 

V. CERTIFICATION 

20. I further declare that all statements made herein of my own knowledge 
are true, that all statements made on information and belief are believed to be true, 
and that these statements were made with the knowledge that willful false statements 
and the like so made are punishable by fine or imprisonment, or both (18 U.S.C. § 
1001), and may jeopardize the validity of the application or any patent issuing 
thereon. 



Dated . iS/s/lca!L 
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Genome-wide association (GWA) studies use high-throughput genotyping tech- 
nologies to assay hundreds of thousands of single-nucleotide polymor- 
n the past 2 years, there has been phisms (SNPs) and relate them to clinical conditions and measurable traits, 
dramatic increase in genomic dis- since 2005, nearly 100 loci for as many as 40 common diseases and traits have 
been identified and replicated in GWA studies, many in genes not previously 
suspected of having a role in the disease under study, and some in genomic 
regions containing no known genes. GWA studies are an important advance 
in discovering genetic variants influencing disease but also have important 
limitations, including their potential for false-positive and false- negative re- 
sults and for biases related to selection of study participants and genotyping 
errors. Although these studies are clearly many steps removed from actual clini- 
cal use, and specific applications of GWA findings in prevention and treat- 
ment are actively being pursued, at present these studies mainly represent a 
valuable discovery tool for examining genomic function and clarifying patho- 
physiologic mechanisms. This article describes the design, interpretation, ap- 
plication, and limitations of GWA studies for clinicians and scientists for whom 
this evolving science may have great relevance. 

JAMA. 2008;299(11):1 335-1 344 www.iama.com 



Icoveries involvng compli 
Mendelian diseases, with nearly 
100 l oci for as many as 40 common dis- 
eases robustly identified and repli- 
cated in genome-wide association 
(GWA) studies (T.A.M.; unpublished 
data, 2008). These studies use high- 
throughput genotyping technologies to 
assay hundreds of thousands of the 
most common form of genetic variant, 
the single-nucleotide polymorphism 
(SNP), and relate these variants to dis- 
eases or health-related traits. 1 Nearly 12 
million unique human SNPs have been 
assigned a reference SNP (rs) number 
in the National Center for Biotechnol- 
ogy In formation's dbSNP database-and 
characterized as to specific alleles (al- 
ternate forms of the SNP), summary al- date gene studies, in which the high c< 



lele frequencies, and other genomic i 
formation. 3 

The GWA approach is revolution- 
ary because ii pcrmiis interrogation of 
the entire human genome at levels of 
resolution previously unattainable, in 
thousands of unrelated individuals, un- 
constrained by prior hypotheses re- 
garding genetic associations with dis- 
ease. ' However, the GWA approach can 
also be problematic because the mas- 
sive number of statistical tests per- 
formed presents an unprecedented po- 
tential for false-positive results, leading 
to new stringency in acceptable levels 
of statistical significance and require- 
ments for replication of findings. 5 

The genome-wide, nonhypothesis- 
d riven nature ol GWA studies repre- 
sents an important step beyond candi- 



of genotyping had limited the number 
of variants assayed to several hundred 
at most. This required careful selec- 
tion of variants to be studied, often 
based on imperfect understanding of the 
biologic pat hu a\ s relating genes to dis- 
ease." Many such associations failed to 
be replicated in subsequent, studies, 78 
leading to calls for all genetic associa- 
tion reports to include documented rep- 
lication of findings as a prerequisite for 
publication. 910 

For non-Mendelian conditions, 
GWA studies also represent a valuable 
ad\ ance over family-based linkage si tid- 
ies, in which multiply affected fami- 
lies are arduously assembled and in- 
heritance patterns are related to several 
hundred markers throughout the ge- 
nome. Family-based linkage studies, al- 
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though successful in identifying genes 
of large effect in Mendelian diseases 
such as cystic fibrosis and neurofibro- 
matosis, have had more limited 
success in common diseases like ath- 
erosclerosis and asthma. 11 Major limi- 
tations of linkage studies are relatively 
low power for complex disorders in- 
fluenced by multiple genes, and the 
large size of the chromosomal regions 
shared among family members (often 
comprising hundreds of genes), in 
whom it can be difficult to narrow the 
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linkage signal sufficiently to identify a 

GWA studies build on the valuable 
lessons learned from candidate gene and 
family linkage studies, as well as the ex- 
panding knowledge of the relation- 
ships among SNP variants generated by 
the International llapMap Project, 1 - 1 - 1 ' 
to capture the great majority of com- 
mon genetic differences among indi- 
viduals and relate them to health and 
disease. These studies not only repre- 
sent a powerful new tool for identifi- 
cation of genes influencing common 
diseases, but also use new terminolo- 
gies (BOX 1), apply new models, and 
present new challenges in interpreta- 
tion. GWA studies rely on the "com- 
mon disease, common variant" hypoth- 
esis, which suggests that genetic 
influences on many common diseases 
will be at least partly attributable to a 
limited number of allelic variants 
present in more than 1% to 5% of the 
population. 14 Many important disease- 
causing variants may be rarer than this 
and are unlikely to be detected with this 
approach. 

Although GWA discovery studies 
provide important clues to genomic 
function and pathophysiologic mecha- 
nisms, they are as yet many steps re- 
moved from actual clinical applica- 
tion. Nonetheless, they have gained 
considerable media attention and have 
the potential for general ing queries from 
patients about whether to get tested for 
the "new gene for disease X" based on 
the latest report. In this article, we de- 
scribe the design, interpretation, appli- 
cation, and limitations of GWA stud- 
ies lor clinicians and scientists lor whom 
this evolving science may have great rel- 
evance. 

Overview of GWA Studies 

A GWA study is defined by the 
National Institutes of Health as a 
study of common genetic variation 
across the entire human genome 
designed to identify genetic associa- 
tions with observable traits. 15 
Although family linkage studies and 
studies comprising tens of thousands 
of gene-based .SNPs also assay genetic 
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variation across the genome, 16 the 
National Institutes of Health defini- 
tion requires sufficient density and 
selection of genetic markers to cap- 
ture a large proportion of the com- 
mon variants in the study population, 
measured in enough individuals to 
provide sufficient power to detect 
variants of modest effect. 

The present discussion focuses on 
studies attempting to assay at least 
100 000 SNPs selected to serve as prox- 
ies for the largest possible number of 
SNPs. 12 The typical GWA study has 4 
parts: (1) selection of a large number 
of individuals with the disease or trait 
of interest and a suitable comparison 
group; (2) DNA isolation, genotyp- 
ing, and data review to ensure high 
genotyping quality; (3) statistical tests 
for associations between the SNPs pass- 
ing quality thresholds and the disease/ 
trait; and (4) replication of identified 
associations in an independent popu- 
lation sample or examination of func- 
tional implications experimentally. 

Most of the roughly 100 GWA stud- 
ies published by the end of 2007 were 
designed to identify SNPs associated 
with common diseases. However, the 
technique can also be used to identify 
genetic variants related to quantita- 
tive traits such as height 17 or electro- 
cardiographic QT interval, 18 and to rank 
the relative importance of previously 
identified susceptibility genes, such as 
APOE*e4 in Alzheimer disease 10 and 
CARD 15 and IL23R in Crohn dis- 
ease/ 0 

GWA studies can also demonstrate 
gene-gene interactions, or modifica- 
tion of the association of one genetic 
variant by another, as with GAB2 and 
APOE in Alzheimer disease, 21 and can 
detect high-risk haplotypes or combi- 
nations of multiple SNPs within a single 
gene, as in exfoliation glaucoma 22 and 
atrial fibrillation. 21 ' These studies have 
also been used to identify SNPs asso- 
ciated with gene expression, either as 
confirmation of a phenotypic associa- 
tion, such as asthma and ORMDL3 ex- 
pression, 24 or more globally. 25 Thus, 
GWA studies have broader applica- 
tions than those solely involving dis- 
pirited) ©2008 Amcr 



cover>' of individual SNPs associated 
with discrete disease end points. 

Study Designs Used in GWA 

By far the most frequently used GWA 
study design to dale has been the case- 
control design, in which allele frequen- 
cies in patients with the disease of in- 
terest are compared to those in a 
disease-free comparison group. These 
studies are often easier and less expen- 
sive to conduct than studies using other 
designs, especially if sufficient num- 
bers of case and control participants can 
be assembled rapidly. This design also 
carries the most assumptions, which if 
not met, can lead to substantial biases 
and spurious associations (TABLE 1). 
The most important of these biases in- 
volve the selected, often unrepresen- 
tative nature of the study case partici- 
pants, who are typically sampled from 
clinical sources and thus may not in- 
clude fatal, mild, or silent cases not 
coming to clinical attention; and the 
lack of comparability of case and con- 
trol participants, who may differ in im- 
portant ways that could be related both 
to genetic risk factors and to disease 
outcomes. 26 

If well-established principles of epi- 
demiologic design are followed, case- 
control studies can produce valid re- 
sults that, especially for rare diseases, 
may not be obtainable in any other way. 
However, genetic association studies 
using case-control methodologies have 
often not always adhered to these prin- 
ciples. The often sharply abbreviated de- 
scriptions of case and control partici- 
pants and lack of comparison of key 
characteristics in GWA reports 27 can 
make evaluation ol potential biases and 
replication of findings quite difficult. 28 
The trio design includes the af- 
fected case participant and both of his 
or her parents. 29 Phenotypic assess- 
ment (classification of affected status) 
is performed only in the offspring and 
only affected offspring are included, but 
genotyping is performed in all 3 trio 
members. The frequency with which an 
allele is transmitted to an affect eel off- 
spring from heterozygous parents is 
then estimated. 29 Under the null hy- 
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Box 1. Terms Frequently Used in Genome-wide Association 

Alleles 

Alternate forms 1 .1 g< 1. • 01 . >m u os, , locus that differ 



■'.<■<: to inlfi.ience repression < ■ f complex phenol) pes 
W!i l>i"l<-gk\tl ami/or p 1 1 > -, 1 1 > 1 , 1 >; ] t ;d properties oi 
>, or to its location near a region of association or 



Studies 

Mendelian disease 

Condition caused almost entire !% by a single major gene, such as 
cystic fibrosis 01 hi ntt 11 is .-, m 1 11 1 i-,m I 
in onlv I null 5 possible genotype 

groups 

Minor allele 



ic polymorphism that is less Irequent in the 



duplicated in varying numlx 
ntficant associations that arc 



f alse-positive report probability' 



Functional studie 



Proportion ol the less common ol J alleles in a population (with 
2 alleles carried l>y each person at each autosvnn.il incus' rang- 
ing from less than I A, to less than 50% 
Modest effect 

Association between a gene variant and disease or trait that is 
statistically significant but cat nes a small odds ratio (usually - l.a) 

\on Mcndchan disease i.ilso "common" or "complex"' disease) 
Condition influeticed by multiple genes and environmental fac- 
tors and not smnvine, 1 011 d 11 inheritance patterns 

Nonsynonymous SNP 



in the presence of envi- 



C.enor 



Any stud) ol genetic van.ition across the enure human genome 
th observable traits or 
the presence or absence of .1 disease, usually referring to studies 
with genetic market density of 100 000 or more to represent a 
large proportion ol variation in the bum.ii> genome 

Genotyping call rate 

Proportion ol samples or SXPs for which a specific allele S\T can 
be reliably identified by a gcnoivping method 

Haplolype 

A group of specific alleles at neighboring genes or markers that 
tend to be inherited together 
HapMap u " 

Genome-wide database of patterns of common human gen- 
iriation among multiple ancestral population 



samples 
Hardy Weinb. 
Population 
such that tl 



alleii 



md q) 



tribution is stable from generation to generality., 
and genotypes occur at frequencies of p", >pq. and q 2 for the 
major allele homozygous heterozygous and miner allele homozv- 
gote, respectively 
Linkage disequilibrium 

Association between 2 alleles located near e.n n otfiei on a chromo- 
some, such that they are inherited together more frequently than 
expected by chance 



Arrays or chilis on \\ Inch high -tin ougbjiut genotyping is performed 
Polymorphic 

A gene or site u ith multiple allelic forms. I he term polymorphism 
usually implies a iinnoi allele frequency of at least 1%' 

Population attributable risk 

Proportion of a disease or trait in the population that is due to a 
specific cause, such as a genetic variant 

Population stratification (also "population structure") 

ti 1 1 1 1 in 1 1 1 1 a 11 1 1 in 1 
neticdill 11 n ncasesai in 1 1 11 I it 1 1 I I 1 
due ;e sampling tti-.-ni (rem populations ol different am est ties 

A statistical tet 

Single-nucleotide polymorphisi 



ti' ■ ' >ii 1 'n I It n md both pare 

ation: SNP, single-rim leendc polymorphism. 



pothesis of no association with dis- 
ease, the transmission frequency for 
each allele of a given SNP will be 50%, 
but alleles associated with the disease 
will be transmitted in excess to the af- 
fected case individual. Because the trio 
design studies allele transmission from 
parents to offspring, it is not suscep- 
tible to population stratification, or ge- 



netic differences between case and con- 
trol participants unrelated to disease but 
due to sampling them from popula- 
tions of different ancestry. 5,1 A signifi- 
cant challenge of the trio design in 
GWA studies is its sensitivity to even 
small degrees of genotyping error, 4 31 
which can distort transmission propor- 
tions between parents and offspring, es- 



pecially for uncommon alleles. There- 
fore, standards for genotyping quality 
in trio studies may need to be more 
stringent than for other designs. 

Cohort studies involve collecting 
extensive baseline information in a 
large number of individuals who arc- 
then observed to assess the incidence 
of disease in subgroups defined by 
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Table 1. St udy Designs Used in Genom e-wide Association S 
Case-Control 



o f as cases of th 

'-r •■<•'.:< ■ c Gagno'-"tc: opecooe/ 
and representativeness are 
!;!'-v;f ! / soeoifio-d 
Gc oi-"c sr. a ; .ii,:fio:-io!-jgic data are 
collected similarly in cases and 

D: H oceesec m sCce 'reqi leasts relate !o 
»N> sules-sos irserest rather than 
differences in background population 
be t ween cases and controls 

Short time frame 

Large nunisw o' case ana control 

paticipants can b^ a 
Gesso eesierassegic (joaein Gr 

studying rare diseases 



Prone to a number of biases including 

population stratification 
Cases are ssnsG gascjler:* cases. 

may exclude fatal or short episodes. 



Overestimate relative risk for cc 



iseasos and traits are ascertained 
similarly in individuals with and 
without the gene variant 



Fewer biases than case-control studies 
Continuum of health-related measures 

available in population 

selected fo • . 
Large sample size needed for 

genotyping if incidence is low 
Expensive and lengthy follow-up 
Existing consent may be i 

GWA genotyping or data sharing 
Requires variation in trail hen , - 
Poorly suited for studying rar e diseases 



Diseasee crated ai'etos are trans; oited in 
excess c 1 50G So • . ,„ 
from heterozygous parents 



Controls for population stn icti ire; 

iniriui < to populati hi ■ i , 
Allows checks for Mendel ai 1 at > 

patterns n gonetyCirig quality control 
i i ii -tudies of 

children's conditions 
^^oesjoo^^e^j^ of parents 

May be difficult to assemble both 

parents and offspring, especially in 

1 t r v/rth older ages of onset 
Hiariy selective to genotyping error 



genetic variants. Although cohort 
studies are typically more expensive 
and take longer to conduct than case- 
control studies, they often include 
study participants who are more rep- 
resentative than clinical series of the 
population from which they are 
drawn, and they typically include a 
vast array of health-related character- 
istics and exposures for which genetic 
associations can be sought. 1718 For 
these reasons, genome-wide genotyp- 
ing has recently been added to cohort 
studies such as the Framingham Heart 
Study 52 and the Women's Health 
Study." 

Many GWA studies use multistage de- 
signs to reduce the number of false- 
positive results while minimizing the 
number of costly genome-wide scans 
performed and retaining statistical 
power. 4 Genome-wide scans are typi- 
cally performed on an initial group of 
case and control participants and then 
a smaller number of associated SNPs is 
replicated in a second or third group of 
case and control participants (Table 2). 
Some studies begin with small num- 
bers of participants in the initial scan but 
cany forward large numbers of SNPs to 
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minimize false-negative results. 34 Other 
studies begin with more participants but 
carry forward a smaller proportion of as- 
sociated SNPs. 35 Optimal proportions of 
study participants and SNPs in each 
phase have yet to be determined, 36 but 
carrying forward a small proportion 
(<5%) of stage 1 SNPs will often mean 
limiting the associations ultimately iden- 
tified to those having a relatively large 
effect. 37 

Selection of Study Participants 

Many genetic studies, whether GWA 
or otherwise, focus on case partici- 
pants more likely to have a genetic 
basis for their disease, such as early- 
onset cases or those with multiple 
affected relatives. Misclassification of 
case participants can markedly reduce 
study power and bias study results 
toward no association, particularly 
when large numbers of unaffected 
individuals are misclassified as 
affected. For diseases that are difficult 
io diagnose reliably, ensuring that 
cases are truly affected (as by invasive 
testing or imaging), is probably more 
important than ensuring generaliz- 
ability, although the limitations on 

minted) ©2008 Amei 



diagnostic reliability and generaliz- 
ability should be clearly described so 
that clinicians can judge the relevance 
to their patients. 

The control participants should be 
drawn from the same population as the 
case participants and should be at risk 
to develop the disease and be detected 
in the study. Inclusion of women as con- 
trols in genetic association studies of dis- 
eases limited to men, for example, is 
problematic in that this approach adds 
individuals to the control group who had 
no chance of developing the disease (but 
might have done so had they also in- 
herited a Y chromosome), thus mixing 
the controls with possible latent cases. 
This artificially reduces the differences 
in allele frequencies between cases and 
controls and limits the ability of the 
study to detect a true difference (ie, re- 
duces study power). 

If the disease is common, such as 
coronary heart disease or hyperten- 
sion in the United States, efforts should 
be made to ensure that the controls are 
truly disease free. Some studies ad- 
dress this by using super-controls or 
persons at high risk but without even 
early evidence of disease, such as per- 
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sons with diabetes of long duration but 
without microalbuminuria in a study 
of diabetic nephropathy. 38 The suc- 
cess of recent GWA studies using con- 
trol groups of questionable represen- 
tativeness due to volunteer bias, such 
as the blood donor cohort in the Well- 
come Trust Case-Control Consortium, 19 
suggests that initial identification of 
SNPs associated with disease may be ro- 
bust to these biases, especially given 
subsequent evidence of replication of 
these associations in studies using more 
traditional control groups. 40 " 42 

Of more concern may be the risk of 
false-negative findings, as many biases 
tend to reduce the magnitude of ob- 
served associations toward the null. Use 
of convenience controls such as blood 
donors, however, may also be problem- 
atic in examining potential modifica- 
tion of genetic associations by environ- 
mental exposures and sociocultural 
factors, and in the identification of less 
strongly associated SNPs. 

A key component in articles reporting 
results in the epidemiology literature 
of observational study is an initial 
table comparing relevant characteristics 
of those with an< i w ithout disease, allow- 
ing assessment of comparability and 
generalizability of the 2 groups. Such 
comparisons are infrequent in GWA 
studies, 28 but they are important because 
common diseases are typically influenced 
by multiple environmental (as well as ge- 
netic) factors. Important differences 
should be adjusted for in the analysis if 
possible, to avoid the risk of identifying 
genetic associations not with the disease 
of interest but with a confounding factor, 
such as smoking 43 or obesity. 44 

Confounding due to population strati- 
fication (also called population structure) 
has been cited as a major threat to the va- 
lidity of genetic association studies, but 
its true importance is a matter of debate. 45 - 46 
When variations occur in allele frequency 
between population subgroups, such as 
those defined by ethnicity or geographic 
origin, that in turn differ in their risk for 
disease, GWA studies may then falsely 
identify the subgroup-associated genes as 
related to disease. 30 Population structure 
should be assessed and reported in GWA 
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studies, typically by examining the dis- 
tribution of test statistics generated f rom 
the thousands of association tests per- 
formed (eg, the x 2 test) and assessing their 
deviation from the null distribution (that 
expected under the null hypothesis of no 
SNP associated with the trai t) in a quantile- 
quanule or "Q-Q," plot (Figure 1). In 
these plots, observed association statis- 



tics or calculated P values for each SNP 
are ranked in order from smallest to larg- 
est and plotted against the values ex- 
pected had they been sampled from a dis- 
tribution of known form (such as the x 2 
distribution). 39 Deviations from the di- 
agonal identity line suggest that either 
the assumed distribution is incorrect or 
that the sample contains values arising 



Table 2. Examples of Multistage Designs in Genome-wide Association Studies 3 


3-Stage Study b 


4-Stage Study c 




Case Participants/ 
Stage Control Participants SNPs Analyzed 


r~ ■ — 

Case Participants/ 
Control Participants SN 


3 s Analyzed 


1 400/400 500 000 


2000/2000 


100000 


2 4000/4000 25000 


2000/2000 


1000 


3 20000/20000 25 


2000/2000 


20 




2000/2000 


5 


Abbreviation: SNP, single-nucteotide polymorphism. 
a Based on hypothetical data. 
b Five SNPs associated with disease. 
c Two SNPs associated with disease. 



e 1. Hypothetical Quantile-Quantile Plots in Genome-wide Associati 



[a] Before-and-afier exclusion of i 



Before-and-after adjustment for population 




Expected X 1 



Expected X : 



' values calculated from th 



The Q-Q plot is used to assess the number and magnitude of observed associations between genotyped singie- 
nucleotide polymorphisms (SNPs) and the disease or trait under study, compared to the association statistics ex- 
' null hypothesis of no association." Observed association st it ti , itrtics) or -log,„ 

inked in order from smallest to largest on the y-axis and plotted against the 

, „,„, ™™ uc c^pc^icu under the null hypothesis of no association on the x-axis. Deviations from the 

identity line suggest either that the assumed distribution is incorrect or that the sample contains values arising in 
some other manner, as by a true association. 59 A, Observed y 2 statistics of all polymorphic SNPs (dark blue) in a 
hypothetical genome-wide association study of a complex disease vs. the expected null list, iutio blaci lin 
The sharp deviation above an expected x 2 value of approximately 8 could be due to a strong association of the 
disease with SNPs in a heavily genotyped region such as the major histocompatibility locus (MHC) on chromo- 
some 6p2 1 in m n 1 i I i rrheu t iarthn lu'ion of SNPs from such a locus may leave a residual 
upward deviation (light blue) identifying more associated SNPs with higher observed x 2 values (exceeding ap- 
mately thai ^pected under the null hypothesis. B, Observed (dark purple) vs expected (black line) v 2 
,jr3 hypothetical genome-wide association study of a complex disease. Deviation from the expected 
distribution is observed above an expected x 2 of approximately 5. Inflation of observed statistics due to related- 
ness and potential f » p latu n trudure can be estimated by the method of genomic control.- 15 Correction for this 
it i by simple dr. ision reduces the unadjusted if 2 statistics (dark purple) to the adjusted levels (light purple) 
showing deviation only above an expected v 2 of approximately 15. The region between expected x 1 of approxi- 
mately 5 to approximately 15 is suggestive of broad differences in allele frequencies that are more likely due to 
population structure than disease susceptibility genes. 
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IL23R: interteukin 23 receptor 



centromeric-*- 



Genome- wide association studies frequently identify associations with many I 
morphisms (SNPs) in a chromosomal region, due in part to linkage disequilibrium, among the SNPs. This can make 
lifl lit Jet i ■ hSN Aithinagroupislikelytobethecausabveorfunctionalvariant.A.Genomiclo- 
cations of 2 genes, the interteukin 23 receptor (IL23R) and the interleukin 1 2 receptor, beta-2 (IL 12Rb2), and a hy- 
pothetical protein, NM_001 01 3674, between positions 62700000 and 67580000 of the short arm of chromosome 
1 at region 1 p31 , are shown. B, The -log, 0 P values for association with inflammatory bowel disease are plotted for 
each SNP genotv ped in the region; those reaching a prespecified value of -log, 0 of 7 or greater are presumed to 

telomeric of position approximately 67400000 and extending jusfcentromeric of position approximate 67450000. 
C, Pairwise linkage disequilibrium estimates between SNPs (measured asr ) are plotted for the region. Higher r' val- 
ues are indicated by darker shading. The region contains 4 "triangles" or "blocks" of linkage disequilibrium, 2 on 
either side of position 67400000 in the IL23R gene, another in the hypothetical protein telomeric of IL23R, and a 
fourth in the IL12RB2 gene at the centromeric end of the region. The 2 IL23R linkage disequilibrium regions each 
contain SNPs associated with inflammatory bowel disease, while the // 12RB2 region does not Reproduced with 
permission from Duerr et al. 53 



in some other manner, as by a true as- 
sociation. 39 

Since the underlying assumption in 
GWA studies is that the vast majority of 
assayed SNPs are not associated with the 
trait, strong deviations from the null sug- 
gest either a very highly associated and 
heavily genotyped locus (Figure 1 , A), or 
significant differences in population struc- 
ture (Figure I , B) . Several effective statis- 
tical methods are available to correct for 
population structure and are a standard 
component of rigorous GWA anal yses. 28 - 50 

Genotyping and Quality Control 
in GWA Studies 

GWA studies rely on the typically strong 
associations among SNPs located near 
each other on a chromosome, which tend 
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to be inherited together more often than 
expected by chance. 50 This nonrandom 
association is called linkage disequilib- 
rium; alleles of SNPs in high linkage dis- 
equilibrium are almost always inher- 
ited together and can serve as proxies for 
each other. Their correlation with each 
other in the population is measured by 
the r 2 statistic, which is the proportion 
of variation of one SNP explained by the 
other, and ranges from 0 (no associa- 
tion) to 1 (perfect correlation). 

Genomic coverage of GWA genotyp- 
ing platforms (arrays or chips on which 
genotyping is performed) is often esti- 
mated by the percent of common SNPs 
having an r of 0.8 or greater with at 
least 1 SNP on the platform. 13 Geno- 
typing platforms comprising 500 000 to 



1 000 000 SNPs have been estimated to 
capture 67% to 89% of common SNP 
variation in populations of European 
and Asian ancestry and 46% to 66% of 
variation in individuals of recent Afri- 
can ancestry. 13 Higher density plat- 
forms now also include probes for copy 
number variants that are not well tagged 
by SNPs. Copy number variants, in 
which stretches of genomic sequence 
are deleted or are duplicated in vary- 
ing numbers, have gained increasing at- 
tention because of their apparent ubi- 
quity and potential dosage effect on 
gene expression. 51 Newer genotyping 
platforms are increasingly being fo- 
cused on capturing copy number vari- 
ants, but other structural variants such 
as insertions, deletions, and inver- 
sions, remain difficult to assay. 52 

GWA studies frequently identify as- 
sociations with multiple SNPs in a chro- 
mosomal region and display the asso- 
ciation statistics by their genomic 
location on a portion of a chromosome 
(Figure 2). For ease of display, asso- 
ciation statistics are typically shown as 
the - logjo of the P value (the probabil- 
ity of the observed association arising 
by chance alone), so that P= .01 would 
be plotted as "2" on the y-axis and 
P= 10" 7 as "7." Such displays also of- 
ten plot a matrix of r 2 values for each 
pair of SNPs in the region, with larger 
r 1 values more intensely shaded. These 
plots can be used to identify linkage dis- 
equilibrium blocks containing SNPs as- 
sociated with disease, allowing estima- 
tion of the independence of the SNP 
associations observed. 53 

Genotyping errors, especially if oc- 
curring differentially between cases and 
controls, are an important cause of spu- 
rious associations and must be dili- 
gently sought and corrected. 54 A num- 
ber of quality control features should 
be applied both on a per-sample and a 
per-SNP basis. Checks on sample iden- 
tity to avoid sample mix-ups should be 
described and a minimum rate of suc- 
cessfully genotyped SNPs per sample 
(usually 80%-90% of SNPs at- 
tempted) should be reported. Once 
samples failing these thresholds are re- 
moved, individual SNPs across the re- 
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Table 3. Association of Alleles and Genotypes of rs6983267 


on Chromosome 8 


q24 With Colorectal Cancer 3 


Number and Frequency of rs6983267 Alleles 
in Colorectal Cancer 




Number and Frequency of rs6983267 
Genotypes in Colorectal Cancer 


C T x 2 (1c") P Value 


I 1 

OR CC 


CT TT x 2 (2o7) P Value OR OR 


Cases 875(56.5) 675(43.5) 24.8 6.3 x 10' 7 


.35 b 250 (32.3) 


375(48.4) 150(19.4) 24.5 4.7 X 10" e 1.33° 1.81 d 


Controls 1 860 (48.9) 1 940 (5 1 . 1 ) 


460 (24.2) 


940 (49.4) 500 (26.3) 



Abbreviation: OR, odds ratio. 

<*Data are hypothetical; adapted from Tomlinson et al. 56 
b Denotes allelic odds ratio. 
^Denotes heterozygote odds ratio. 
d Denotes homozygote odds ratio. 



maining samples are subjected to fur- 
ther checks or filters for probable 
genotyping errors, including: (1) the 
proportion of samples for which a SNP 
can be measured (the SNP call rate, 
typically >95%); (2) the minor allele 
frequency (often >1%, as rarer SNPs 
are difficult to measure reliably; (3) se- 
vere violations of Hardy- Weinberg equi- 
librium; (4) Mendelian inheritance er- 
rors in trio studies; and (5) concordance 
rates in duplicate samples (typically 
>99.5%). 

Additional checks on genotyping 
quality should include careful visual in- 
spection of genotype cluster plots, or 
intensity values generated by the geno- 
typing assay to ensure that the stron- 
gest associations do not merely reflect 
genotyping artifact. 28,39 Genotyping the 
most strongly associated SNPs should 
also be confirmed using a different 
method. 28 Associations with any known 
"positive controls," such as TCF7L2 in 
type 2 diabetes mellitus 55 or HLA- 
DRB1 in rheumatoid arthritis, 47 should 
be reported to increase confidence in 
the consistency of findings with prior 
reports. 

Analysis and Presentation 
of CWA Results 

Associations with the 2 alleles of each 
SNP are tested in a relatively straight- 
forward manner by comparing the fre- 
quency of each allele in cases and con- 
trols (Table 3). Because each individual 
carries 2 copies of each autosomal SNP, 
the frequency of each of 3 possible 
genotypes can also be compared 
(Table 3). Exploratory analyses may 
also include testing of different ge- 
netic models (dominant, recessive, or 



Figure 3. Genome-wide Association Findings in Rheumatoid Arthritis 




Genome- wide association studies assume a priori hypotheses about candidate genes or regions that mi 
associated with disease; rather, they test single-nucleotide polymorphisms (SNPs) throughout the geno 
possible evidence of genetic i ilit ciations plotted as -log,„ P values for a genome-wide a 

tion study in 1522 cases with rheumatoid arthritis and 1 850 controls, showing single data points for SNI 
P < 10' 4 (lower horizontal red line) for 22 autosomes and the X chromosome. The predefined level of 
cance, at 5 x 10 s is shown with a horizontal blue line. SNPs at PTPN22 on chromosome 1 , the major 
compatibility comples (MHC) on chromosome 6, and the TRAF1-C5 locus on chromosome 9 exceed this t 
old. Reproduced with permission from Plenge et al.'" 



additive), although additive models, in 
which each copy of the allele is as- 
sumed to increase risk by the same 
amount, tend to be the most common 
(T.A.M.; unpublished data, 2008). Odds 
ratios of disease associated with the risk 
allele or genotype(s) can then be cal- 
culated and are typically modest, of- 
ten in the range of 1.2 to 1.3. Many 
studies also calculate population at- 
tributable risk, classically defined as the 
proportion of disease in the popula- 
tion associated with a given risk factor 
(in this case, a genetic variant). 57 

Such estimates are nearly always in- 
flated because odds ratios overesti- 
mate relative risks (especially for com- 
mon diseases 58 ) needed for population 
attributable risk calculations, and be- 
cause odds ratios and allele frequen- 
cies in published reports have wide con- 



fidence intervals so that those selected 
by exceeding a specified threshold for 
statistical significance tend to be bi- 
ased upwards, an effect of ascertain- 
ment known as the "winner's curse." 5,5 
This exaggerated initial estimate of the 
odds ratio often leads to replication 
studies that lack sufficient sample size 
and power to replicate the association 
because larger samples are needed to de- 
tect smaller odds ratios. 

Complexity in analysis emerges due 
to the multiple testing carried out in 
GWA studies, in that the association 
tests shown in Table 2 are repeated for 
each of the 100 000 to more than 1 mil- 
lion SNPs assayed (Figure 3). At the 
conventional P<.05 level of signifi- 
cance, an association study of 1 mil- 
lion SNPs will show 50 000 SNPs to be 
"associated" with disease, almost all 
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Box 2. Ten Basic Questions to Ask About a Genome-wide Association Study 

Report 3 

1. Arc the cases defined clearly and reliably so ihat they can he compared with 
patients typically seen in clinical practice? 

2. Are case and control participants demonstrated to be comparable to each other 
on important characteristics that might also be related to genetic variation and to 

3. Was the study of sufficient size to detect modest odds ratios or relative risks 
(1.3-1.5)? 

4. Was the genotyping platform of sufficient density to capture a large proportion 
of the variation in the population studied? 

5. Were appropriate quality control measures applied to genotyping assays, in- 
cluding visual inspection ol clustei plots and replication on an independent geno- 
typing platform? 

6. Did the study reliably detect associations with previously reported and repli- 
cated variants f known positives)? 

7. Were stringent corrections applied for the many thousands of statistical tests 
performed in defining the P value for significant associations? 

8. Were the results replicated in independent population samples? 

9. Were the replication samples comparable in geographic origin and phenotype 
definition, and if not, did the differences extend the applicability of the findings? 

10. Was evidence provided for a functional role for the gene polymorphism iden- 
tified? 

"For a more detailed description ol interpretation of genome-wide association studies, see 
NCI/NHGRI Working Group on Replication in Association Studies* 



tatseiy positive ana due to chance alone. 
The most common manner of dealing 
with this problem is to reduce the false- 
positive rate by applying the Bonfer- 
roni correction, in which the conven- 
tional P value is divided by the number 
of tests performed. 60 A 1 million SNP 
survey would thus use a threshold of 
P< .05/1 0 6 , or 5 X 10- 8 , to identify as- 
sociations unlikely to have occurred by 
chance. This correction has been criti- 
cized as overly conservative because it 
assumes independent associations of 
each SNP with disease even though in- 
dividual SNPs arc known to be corre- 
lated to some degree due to linkage dis- 
equilibrium. 

Other approaches have been pro- 
posed, including estimation of the false 
discovery rate or proportion of signifi- 
cant associations that arc actualh false 
positive associations. ' 1 " : false-positive re- 
port probability, or probability that the 
null hypothesis is true given a statisti- 
cally significant finding, 63 and estima- 



tion ol Bayes factors that incorporate the 
prior probability of association based on 
characteristics of the disease or the spe- 
cific SNP. 39 To date, Bonferroni correc- 
tion has generally been the most com- 
monly used correction for multiple 
comparisons in GWA reports (T.A.M.; 
unpublished data, 2008). 

Replication and Functional 
Studies 

Given the major challenge of separat- 
ing the many false-positive associations 
from the few true-positive associations 
with disease in ( ,W.\ studies, an impor- 
tant strategy has been replication of re- 
sults in independent samples. 28 This is 
typically included in a single GWA re- 
port as part of a multistage design 34 35 or 
may he reported separately. iuM Consen- 
sus criteria for replication have recently 
been published and include study of the 
same or very similar phenotype and 
population, and demonstration of a simi- 
lar magnitude of effect and significance 



(in the same genetic model and same di- 
rection) for the same SNP and the same 
allele as the initial report.-' 8 Replication 
is usually first attempted in studies as 
similar as possible to the initial report, 
but then mas' be extended to related phe- 
notypes (such as fat mass in addition to 
obesity 44 .), different populations (such as 
West Africans in addition to Iceland- 
ers 65 ), or different study designs 53 to re- 
fine and extend the initial findings and 
increase confidence in verity. 

Lack of reproducibility of genetic as- 
sociations has been frequently ob- 
served and has been varying]}' attrib- 
uted to population stratification, 
phenotype differences, selection bi- 
ases, genotyping errors, and other fac- 
tors. 2866 At. present, the best way of re- 
solving these inconsistencies appears to 
be additional replication studies with 
larger sample sizes, although this may 
not be feasible for rare conditions or for 
associations identified in unique popu- 
lations. 28 

Identification of a robustly replicat- 
ing SNP-disease association is a cru- 
cial first step in identifying disease- 
causing genetic variants and developing 
suitable treatments, but it is only a first 
step. Association studies essentially 
identify a genomic location related to 
disease but provide little information 
on gene function unless SNPs with pre- 
dictable effects on gene expression or 
the transcribed product happened to be 
identified. Few of the associations iden- 
tified to date have involved genes pre- 
viously suspected of being related to the 
disease under study, and some have 
been in genomic local ions harboring no 
known genes. 27 67 Examination of 
known SNPs in high linkage disequi- 
librium with the associated SNP may 
identify variants with plausible bio- 
logic effects, or sequencing of a suit- 
able surrounding interval may be un- 
dertaken to identify rarer variants with 
more obvious functional implica- 
tions. Tissue samples or cell lines can 
be examined for expression of t he gene 
variant. Other functional studies may 
include genetic manipulations in cell or 
animal models, such as knockouts or 
knock-ins. 68 
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Limitations of CWA Studies 

The potential lor false-positive results, 
lack of information on gene function, in- 
sensitivity to rare variants and struc- 
tural variants, requirement for large 
sample sizes, and possible biases due to 
case and control selection and genotyp- 
ing errors, are important limitations of 
GWA studies. The often limited infor- 
mation available about environmental 
exposu res and other non-genetic risk fac- 
tors in GWA studies will make it diffi- 
cult to identify gene-environment inter- 
actions or modification of gene-disease 
associations in the presence of environ- 
mental factors. Clinicians and scien- 
tists should understand the unique as- 
pects of these studies and be able to assess 
and interpret GWA results for them- 
selves and their patients. Ten basic ques- 
tions to ask about GWA studies, many 
of which also apply generically to asso- 
ciation studies of nongenetic risk fac- 
tors, are outlined in Box 2. Most of these 
questions should be answered in the af- 
firmative for a reliable report; how- 
ever, many GWA reports lack suffi- 
cient detail to assess them. 28 

Many of the design and analysis fea- 



GWA s 
he fata 



to idei 
. Thes, 



forts to reduce false positive results, how- 
ever, may result in overlooking a true 
association, especially if only a small 
number of SNPs are carried over from the 
initial scan into replication studies. The 
most robust findings, ie, those that "sur- 
vive" multiple rounds of replication, are 
often not the most statistically signifi- 
cant associations in the initial scan, and 
may not even be in the top few hundred 
associations."" 70 Another cause of false- 
negative results is the lack of the ge- 
netic variant of relevance on the geno- 
lyping platform, or lack (if variation in 
that SNP in the population under study. 
As the number of SNPs and diversity of 
populations represented on genolypmg 
platforms increase, this should become 
less of a problem. 

An important question generated bv 
these early GWA studies relates to the 
small proportion of heritability, or fa- 
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milial clustering explained by the ge- 
netic variants identified to date. Most of 
these variants have very modest effects 
on disease risk, increasing it by only 20% 
to 50%, and explaining only a small frac- 
tion of population risk or total esti- 
mated heritability for most condi- 
tions.'"- 71 Might the rest of the genetic 
influence reside in a long "tail" of com- 
mon SNPs with very small odds 
ratios, in copy number variants or other 
structural variants, rarer variants of larger 
effect, or interactions among common 
variants? Or has familial clustering due 
to genetic factors been overestimated and 
important environmental influences, 
either acting alone or in combination 
with genetic variants, been overlooked? 
This remains to be determined, but it is 
important to realize that even small odds 
ratios or rare variants can suggest im- 
portant therapeutic strategies such as the 
development of HMG-CoA reductase in- 
hibitors arising from identification of 
LDL-receptor mutations in familial hy- 
percholesterolemia. 72 

Clinical Applications 
of CWA Findings 

Despite the considerable media atten- 
tion that GWA reports frequently re- 
ceive, these studies are clearly many 
steps removed from actual clinical ap- 
plication. The primary use for GWA 
studies for the foreseeable future is 
likely to be in investigation of biologic 
pathways of disease causation and nor- 
mal health and development. This is not 
to suggest that some early successes 
may not occur in the near future, 
through rapid development of treat- 
ment strategies such as inhibitors of 
complement activation in age-related 
macular degeneration. 7 - Use of GWA 
findings in screening for disease risk, 
while beginning to be marketed com- 
mercially, is more problematic. Al- 
though obtaining the latest "gene test" 
may be alluring to a technology- 
focused society, evidence is needed that 
such screening adds information to 
known risk factors (such as age, obe- 
sity, and family history for diabetes), 
that effective interventions are avail- 
able, thatimpnn ed outcomes justify the 



associated costs, and that obtaining this 
information does not have serious ad- 
verse consequences for patients and 
their families. Such evidence is likely 
to be some ways off, but the initial burst 
of discovery generated by GWA scans 
has now mandated a concerted effort 
to search for these answers. 
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conduct such studies are likely to assume that their work 
falls under this category. Conversely, public health and health 
services researchers might be confused to see the term ap- 
plied to their work. 

Furthermore, using "clinical" in both terms perpetu- 
ates the tendency of the medical profession to view 
health research through the clinician's lens alone. Fiscella 
et al do include "organizational- and community- 
focused" research within their definition of applied clini- 
cal research, but labeling health interventions outside the 
clinic as "clinical" research may be a forced fit. Pros and 
cons exist with other potential terms such as knowledge 
translation — the term discussed by Dr Graham and Ms 
Tetroe— but all of them are an improvement over the 
ambiguity of T2. 

Graham and Tetroe call attention to the excellent work 
of the Canadian Institutes of Health Research. Canadian in- 
vestigators and institutions have played a leadership role not 
only in writing about the need for researchers to align their 
work with the information needs of end users 1 but also in 
making real commitments in programs and funding to fa- 
cilitate T2 as a nation. 2 The United States would do well to 
follow the Canadian example. 

Tl is among a group of clinical research movements 
that are attracting attention and resources but are ulti- 
mately unhelpful to patients without T2. Recently, politi- 
cians and industry have announced plans to channel mil- 
lions of dollars per year into research on "comparative 
effectiveness" 3 and "personalized medicine" 4 while keep- 
ing funding for health services research threadbare. 5 
Popular research initiatives address worthy questions: 
whether a treatment can be produced (Tl), whether it 
improves health (evidence-based medicine), which treat- 
ment is best (comparative effectiveness), and which is 
best for an individual patient, (personalized medicine). 
But the answers remain academic if the patient cannot 
obtain or use the intervention. Overcoming such 
obstacles so that the products of research benefit all those 
in need is itself a crucial research priority. 
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CORRECTIONS 



Incorrect Data: In the Perspectives on Care at the Close of Life article titled "Man- 
aging an Acute Pain Crisis in a Patient With Advanced Cancer: 'This Is as Much of 
a Crisis as a Code' " published in the March 26, 2008, issue of JAMA (2008:299 
(12): 1457- 1467), an incorrect dose ratio appeared in Table 2. The hydromorphone- 
to-methadone ratio for less than 330 mg/24 hours of hydromorphone that read 
"16:1" should have been "1.6:1." 

Incorrect Legend: In the Special Communication entitled "How to Interpret a Ge- 
nome-wide Association Study" published in the March 19, 2008, issue of JAMA 
(2008;299[11]:1335-1344), an integral word was omitted from the Figure 3 leg- 
end. The sentence that read, "Genome-wide association studies assume a priori 
hypotheses about candidate genes or regions that might be associated with dis- 
ease; rather, they test single-nucleotide polymorphisms (SNPs) throughout the ge- 
nome for possible evidence of genetic susceptibility" should have read, "Genome- 
wide association studies assume no a priori hypotheses about candidate genes or 
regions that might be associated with disease; rather, they test single-nucleotide 
polymorphisms (SNPs) throughout the genome for possible evidence of genetic 
susceptibility." 

Unreported Research Funding: In the Research Letter titled "Exhaled Carbon Mon- 
oxide With Waterpipe Use in US Students," published in the January 2, 2008 is- 
sue of JAMA (2008;299[1j:36-38), the Financial Disclosures should have'in- 
cluded the following: Dr Hammond reports that she has received research funding 
for studies on environmental tobacco smoke from the National Institutes of Health 
and from the Flight Attendants Medical Research Institute. However, none of these 
grants were used to support the study reported in this Research Letter. 
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Guilt by association 



Anne M. Bowcock 



In a tour-de-force demonstration of feasibility, a consortium of 50 research teams uses 500,000 genetic 
markers from each of 17,000 individuals to identify 24 genetic risk factors for 7 common human diseases. 



Mr Woodhouse, the comical hypochondriac 
of Jane Austens Emma, takes great comfort in 
blaming his various ailments on the rain, the 
cold and an unfortunate piece of wedding cake. 
He would, no doubt, have been greatly sur- 
prised to learn that even his most rudimentary 
ailments resulted, at least in part, from genetic 
factors. Reporting on page 661 of this issue 1 , 
a consortium of more than 50 British groups, 
known collectively as the Wellcome Trust Case 
Control Consortium (WTCCC), asserts just 
that. In the largest study of its type so far, the 
WTCCC has examined the genetic under- 
pinnings of seven common human diseases: 
rheumatoid arthritis, hypertension, Crohn's 
disease (the most common form of inflam- 
matory bowel disease), coronary artery dis- 
ease, bipolar disorder — also known as manic 
depression — and type 1 and type 2 diabetes. 

The WTCCC study is groundbreaking 
in various respects. It not only confirms the 
involvement of some genes for which disease 
association has previously been reported, but 
it also identifies several novel genes that affect 
susceptibility to common diseases. Moreover, 
it models a successful and instructive approach 
to large-scale genomic scans of this type, show- 
ing that a set of common controls can be used 
for a variety of diseases with relatively little loss 
of analytical power. Its success also provides 
strong grounds for performing such studies on 
an even larger scale. 

The WTCCC investigators examined genetic 
variation at 500,000 different positions within 
the genomes of 17,000 individuals living in 
Britain using a genome-wide association scan 
(Fig. 1 ). This statistical approach compares the 
frequencies of genetic variation in disease cases 
and in healthy controls from the same popula- 
tion. Using the signal from each position as an 
indicator for the DNA sequence that surrounds 
it, genome-wide association scans examine the 
relationship between each DNA position and 
a particular trait (such as diabetes). Strong 
'association' between a DNA position and a 
trait marks the general locale of the offending 
alteration, even if it is not itself the cause. 

The concept of drawing an association 
between biological traits and disease is hardly 
new 2 , but the scope and scale that the WTCCC 
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Alleles 

Figure 1 1 Genome-wide association scan. To 

identify genetic risk factors for common 
diseases, the WTCCC researchers' scanned DNA 
from patients (2,000 per disease) and controls 
(3,000 shared for all seven diseases studied) 
for the frequency with which they contained 
each of the 500,000 genetic markers, or single 
nucleotide polymorphisms (SNPs), from the 
human genome. After statistical evaluation of 
the data, they found that most markers showed 
very little difference in the frequency of their 
two constituent forms — or alleles — between 
controls and cases. However, some SNPs 
occurred at a greater frequency in patients. Such 
alleles (one is shown in red) can be considered a 
genetic risk factor for a particular disease. 

attained in their application of this concept is 
unprecedented. Crucial to both the success 
of this study and keeping its cost reasonable 
were DNA from large numbers of unrelated 
patients; the availability of the complete DNA 
sequence of the human genome; the subsequent 
cataloguing of a large component of variation 
in the genome in the form of single nucleotide 
polymorphisms (SNPs) 3 ; the completion of the 
HapMap project 4 , which provided information 
on the statistical relatedness of SNPs; and the 
availability of high-throughput technologies 
that allowed for parallel typing of 500,000 
markers representing most of the common 
variation in the genome. 



For the seven diseases studied by the 
WTCCC, strong statistical evidence for asso- 
ciation was obtained for 12 previously identi- 
fied genomic regions and a similar number of 
new regions. Although this WTCCC report is 
based on initial studies, independent groups 5 " 9 
have confirmed the involvement of all but one 
of these most significant regions through rep- 
lication studies. Some of the other identified 
regions with less statistically significant disease 
association are also likely to be true indicators 
of genetic risk; so these will need to be further 
evaluated in additional large sets of patients 
and controls. Indeed, because the WTCCC 
data will be publicly available, they will be a 
useful resource to other groups and consor- 
tia embarking on similar efforts to investigate 
genetic-association markers in these and other 
diseases. These researchers include members of 
the Genetic Association Information Network 10 
(GAIN), the Framingham Genetic Research 
Study and the Women's Health Study. 

With many of the genomic regions identified 
by the WTCCC, the next step will be to study 
the exact nature of the disease-causing variants, 
rather than the marker SNP with which each 
is associated. From this and previous studies, 
it seems that variations leading to common 
disease are diverse; some alter the coding 
sequences of genes, others lie within their non- 
coding sequences, and some are even located 
within gene deserts — regions of a chromo- 
some that contain no genes. So understanding 
the biological function of disease-risk-associ- 
ated genomic regions will be challenging. 

Two replication studies relating to the 
WTCCC findings are also published today 5,6 , 
revealing connections between the genomic 
regions associated with the risk of type 1 dia- 
betes and Crohn's disease and their underlying 
biology. Some of the known and newly identi- 
fied genetic risk factors for type 1 diabetes alter 
the development or function of immune cells, 
leading to aberrant recognition of pancreatic 
islet cells as foreign particles. But additional 
susceptibility genes identified recently 5 do not 
fit easily into this simple model. 

For Crohn's disease, one of the newly identi- 
fied 6 susceptibility genes is of particular inter- 
est because it is proposed to control the spread 
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of intracellular pathogens by autophagy — the 
process of cellular self-digestion. This is the 
second gene to be implicated in Crohn's dis- 
ease through involvement in autophagy; the 
first was identified earlier this year 1 u2 . More- 
over, an increasing body of evidence, including 
the latest replication study 6 , points to defects in 
the early immune response and the handling of 
intracellular gut bacteria in the pathogenesis of 
Crohn's disease. 

The overall increase in risk (1.2-1.5 times) 
conferred by the genetic factors identified in 
the WTCCC study 1 is in agreement with those 
reported by others. However, these factors are 
unlikely to explain completely the clustering of 
any of these diseases in families, and there are 
other genes (possibly many of very small effect) 
— or rare variants of genes — that are still to be 
identified for these and other diseases. 

One unexpected result of the WTCCC study 
was the identification of 13 regions with pro- 
nounced geographical variation within Britain. 
Among these regions is a large cluster of genes 
that encodes the major histocompatibility 
complex, which is well known for its function 
in the immune response and autoimmune dis- 
ease 1 ', and a gene that is involved in lactase per- 
sistence, or the ability to digest milk 14 ' 15 . Some 
of the other regions are thought to function 
in preventing diseases such as pellagra, tuber- 
culosis and leprosy. Although the infectious 
agents responsible for tuberculosis and leprosy 
are now rare in Britain, they have left behind 
genetic footprints in the existing population 
that probably led to some degree of protection 
in the past. Several of these are also candidate 
genes for autoimmune disease 5 . 

Despite the magnitude and wealth of infor- 
mation that this study 1 provides, other ques- 
tions about the genetic basis of common 
disease remain. The answers will become 
increasingly important as we enter an era of 
personalized medicine, in which therapy is 
tailored to an individual's genetic constitution. 
It will become crucial to discover which genes 
predispose individuals to these diseases; how 
genes interact with each other to increase the 
risk of a particular disease; and what propor- 
tion of disease is due to rare variants that would 
be hard to detect with current approaches. 

We will also want to know whether different 
patients can be stratified into subpopulations 
on the basis of genetic risk factors, and what 
role the environment has in triggering disease. 
The Genes, Environment and Health Initiative 
(GEI) of the US National Institutes of Health 
already aims to develop tools to assess environ- 
mental contribution and to answer some of the 
other questions. Ultimately, comprehensive 
answers that would allow the translation of 
genetic susceptibility into scientifically sound 
medical practice will require much larger 
patient populations, well-annotated clinical 
databases and sophisticated environmental 
assessment. One wonders what Mr Woodhouse 
would have to say to that. 
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SPECTROSCOPY 

The magic of solenoids 

Arthur S. Edison and Joanna R. Long 

A technique known as magic-angle spinning has helped make nuclear 
magnetic resonance spectroscopy as sensitive for solids as it is for 
solutions. Inductive thinking leads to even better signal detection. 



The great strength of nuclear magnetic reso- 
nance (NMR) spectroscopy is that it can deter- 
mine, non-invasively and at atomic resolution, 
the chemistry, structure, dynamics and over- 
all architecture of samples in solid, liquid or 
even gaseous forms. The liquid version of the 
technique, solution NMR, is used routinely to 
identify small molecules, study protein struc- 
tures and dynamics, and probe intermolecular 
interactions. Solid-state NMR teases out the 
structure and properties of materials, surfaces 
and biological solids such as human tissue. But 
compared with many other analytical tech- 
niques, NMR has extremely poor sensitivity. 
A great deal of research has sought to improve 
this situation: on page 694 of this issue 1 , Sakel- 
lariou et al. describe a potential leap forward 
for solid-state NMR. 
When atomic nuclei with non-zero spin 



are placed in an external magnetic field, they 
become polarized, precessing rather as a gyro- 
scope does in Earths gravitational field. When 
electromagnetic radiation of a frequency 
(energy) that corresponds exactly to that of 
the energy gap between two states of differ- 
ent polarization is applied to the sample, the 
nuclei resonate, jumping between those states. 
The accompanying gyroscopic precession of 
the spins induces a current in a conducting coil 
placed around the sample. This basic principle 
is both NMR's blessing and its bane as a spectro- 
scopic technique: the small energies make the 
approach non-destructive, but they also make it 
difficult to distinguish the characteristic polari- 
zation (or signal) from thermal noise. 

The signal-to-noise ratio in NMR measure- 
ments can be improved by either one of two 
general routes. The first of these is enhancing 




Figure 1 1 Inductive logic, a, In the traditional 'magic-angle spinning' approach to solid-state NMR, a 
spectrum of better resolution is achieved by rapidly rotating the sample, at an angle of 54.7° relative to 
the main magnetic field, within a static coil assembly, b, Sakellariou and colleagues' alternative approach' 
uses the inductive coupling of a smaller coil rotating with the sample to the larger static coil to produce a 
similar effect. The result is a higher sensitivity and the capability to investigate smaller samples. 
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I Guilt beyond a reasonable doubt 

| David Altshuler & Mark Daly 

■g Genome-wide association studies, exemplified by the Wellcome Trust Case Control Consortium and follow-up 

jj studies, have identified dozens of common variants robustly associated with common diseases, providing new clues 

| about genetic architecture in humans. Finding all such loci, and fully defining genotype-phenotype correlation, will 

g be a key to translating initial clues into pathophysiological understanding and clinical prediction. 



a Genetic screens are used to explore biological 
■c mechanisms in vivo, unbiased by prior a*>unip- 
Q. tions about the DNA alterations responsible 
g for phenotypic variation. In model systems, 
O genome-wide, pheiiolype-driven screens typi- 
c cally identify many genes of unknown func- 
■gj tion, ultimately leading to a broad and deep 
■g understanding of mechanism. 
£ In humans, success with phenotype- 
<u driven, genome-wide screening for inher- 
3 ited disease mutations has been limited to 
Z mendelian traits. Human phenotypic varia- 
£ tion is largely polygenic rather than mono- 
° genie, however, and thus the vast majority 
© of heritable factors for common human 
diseases remain unknown. Genome-wide 
association studies (GWASs) have been pro- 
posed as a new approach to 'forward genet- 
ics* in humans, but until recently they were 
untested (or gene discovery. 

The Wellcome Trust Case Control 
Consortium (WTCCC) now reports in 
Nature the largest GWAS thus far 1 , scan- 
ning 17,000 individuals for seven diseases, 
with two follow-up studies reported in this 
issue, Todd et al. on type 1 diabetes (page 
857) and Parkes et al. on Crohn's disease 
(page 830), and another on type 2 diabetes 
published elsewhere 2 "" 1 . Together with other 
publications, statistically compelling asso- 
ciations have been identified this year by 
GWASs across a variety of diseases, including 
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Crohn's disease, obesity, type 1 and type 2 
diabetes, coronary heart disease and pros- 
tate and breast cancer (see Supplementary 
Note for additional references). In multiple 
diseases, five to ten independent genomic 
regions have been identified and confirmed. 
After years as 'Keystone Cops', complex trait 
geneticists can now find culprits not previ- 
ously suspected and establish guilt beyond a 
reasonable doubt. 

The current crop of successful studies 
shares five key features. First, they all use 
high-density SNP genotyping arrays (based 
on the Human Genome Project, the SNP 
Consortium and the HapMap Project) and 
analytical methods built on the synthesis of 
population genetics, statistical genetics and 
epidemiology. Second, the clinical investiga- 
tors had the foresight to collect large patient 
samples that included detailed phenotype 
information, DNA samples and informed 
consent for genetic research. Third, they 
have paid careful attention in their design 
and analysis to minimizing bias (coming 
from, for example, population substructure, 
genotyping errors or variability in DNA 
quality and laboratory processing). Fourth, 
they have applied statistical thresholds 
appropriate to genome-wide searches. With 
-10 million common SNPs to be tested 
genome-wide, and few true associations for 
which power is adequate, the prior prob- 
ability of a true association is low — and the 

cussion of power in the WTCCC, see pages 
815-816 in this issue). Finally, they have 
validated putative 'positives in indepen- 
dent samples (preferably using independent 
genotyping technologies). Here, 'replication' 



refers to association of the same allele to the 
same trait under the same genetic model 5 . 

What has been learned? 

The most important outcome of these studies 
is the discovery of new biological associations 
in genes or regions previously unrecognized 
to have a role in each disease. In some cases, 
links have been newly established between 
diseases and well-studied pathways (such as 
age-related macular degeneration and the 
complement pathway, Crohn's disease and 
autophagy). In many cases, however, asso- 
ciated regions contain genes of unknown 
function or do not contain annotated genes. 
Typical of genetic screens in model systems 
and mendelian genetics, an unbiased genetic 
approach highlights genes not previously 
identified. 

Second, new mechanistic connections have 
been uncovered between diseases. Examples 
include SNPs in IL23R with Crohn's disease 6 
and psoriasis 7 , PTPN2 with Crohn's disease 
and type 1 diabetes 1 , PTPN22 and JURA 
with type 1 diabetes and rheumatoid arthri- 
tis 1 , 8q24 with prostate cancer and breast 
cancer* (see also Stacey et til. (page 865) and 
Hunter et al. (page 870), in this isue) and 
nearby SNPs in a noncoding region of 9p 
near CDKN2B and CDKN2A with type 2 dia- 
betes 4 ' 9,10 and coronary heart disease 1 - 1 U2 . 

Third, the studies have found a substantial 
fraction of associations outside of transcrip- 
tion units. Tins is unsurprising, as coding 
sequences make up less than half of the evo- 
lutionary conserved DNA in the human 
genome. Investigation of functional noncod- 
ing associations will be critical to unraveling 
molecular and cellular roles of noncoding 
functional DNA in humans. 
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; Fourth, the results indicate that individual 

■ SNPs have very modest effects in the popula- 
i tion: associated SNPs rarely show odds ratios 

■ of >2.0 (CFH in age-related macular degen- 
; eration), and more typically, odds ratios are 
; <1.5. Undiscovered common variants are 
L likely to have similar or smaller effects (or 
'■ are in low linkage disequilibrium with SNPs 
t on arrays). 

Fifth, strong evidence is lacking for 

1 epistasis among associated SNPs, despite 

1 joint analysis in large cohorts. Similarly, 
little evidence has been obtained for strong 

homogeneous disease subtypes, or quanti- 
tative 'endo-phenotypes' (such as glycemic 
and obesity traits in type 2 diabetes). Sixth, 
despite substantial progress, the vast major- 
ity of heritability remains unexplained. To 
some extent, the magnitude of the associa- 
tions discovered is currently underestimated, 
because the full spectrum of causal variation 

I at each locus has yet to be defined by deep 

■sequencing. 

A less obvious but still important impli- 
cation is that many more such loci must 
remain to be found. Even for the confirmed 
associations identified, statistical power 
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found them (Table 1). Even in the large 
WTCCC study (which included 2,000 cases 
and 3,000 controls) 1 , the power to obtain a 
genome-wide P < 10~ 8 was <1% for many 
of the confirmed associations discovered by 
comparison across studies and by replication 
studies. This explains the tendency of differ- 
ent GWASs to find partially overlapping sets 
of associations and makes it implausible that 

have been identified. ° 

Where to from here? 

These papers provide proof-of-concept that 
GWASs can identify previously unknown 
causal loci. The next steps are to obtain a full 
picture of genotype-phenotype correlation at 



these loci and to find remaining loci. A moi 
complete picture will be critical to undei 
standing the disease mechanisms underlyin 
the associations and to assess SNPs for clin: 
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ill be needed to discover 
all causal mutations and fully define geno- 
type-phenotype correlation. In many cases, 
multiple independent common variants 13 
and rare variants 1 ' 1 will be found at the same 
locus. Sequencing of exons in each associ- 
ated region may identify coding mutations of 
stronger effect, which may be easier to study 
in vitro and in individual subjects. In addi- 
tion, identification of 'smoking gun' causal 
coding mutations may help prove which gene 
at each locus is responsible for the associa- 
tion and inav. m aggregate, itu t ease the over- 
all predictive value of genotype. 

A testable hypothesis suggested by the 
power calculations in Table 1 is that a more 
extensive set of loci that influence each dis- 
ease may be found by GWASs of greater 
power (or by combining existing GWASs). 
Common sense dictates that a complete set 
of susceptibility loci will provide greater 
biological insight than an incomplete set. 
Moreover, the biological insight provided by 
any locus is not necessarily related to the size 
of the effect of common variants used to dis- 
cover it, nor is it predictive of the combined 
effect of all rare and common variants at 
that locus. Thus, the discovery of additional 
causal loci should be pursued, followed by 
exhaustive sequencing to fully define geno- 
type-phenotype correlation. 

Some loo may he missed b> well-powered 
GWASs because none of the causal variants 
are in linkage disequilibrium with SNPs on 
the genotyping arrays. Some of these may 
be found by genome wide measurement of 
copy number variation. Thus, these GWASs 
are the first in a series of genome-wide, 
phenotype-driven approaches in humans, 



which, when integrated, will provide a more 
complete picture of human phenotype varia- 
tion and inborn susceptibility to disease. 

Ultimately, the value of this endeavor must 
be measured in the resulting clinical and bio- 
logical advances. Predictive testing will have 
value in cases in which effective preventa- 

changes in risk improve clinical decision- 
making. Achieving a clinical benefit will be 
challenged by the modest magnitude of SNP 
effects and by the likelihood that genetic tests 
will be made available (and aggressively pro- 
moted) before or instead of moun ting clini- 
cal trials to evaluate the value of genetically 
enabled decision-making. 

New tools and frameworks will be required 
to translate genetic insights into knowledge 
of disease pathogenesis and new therapeu- 
tics: there is little precedent for functional 
analysis based on genes discovered by poly- 
genic inheritance, noncoding DNA changes 
and quantitative alteration of gene function. 
This quest is worth mounting, however, as it 
is in pursuit of culprits whose guilt in human 
disease has been established beyond a rea- 
sonable doubt. 
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Conjuring SNPs to detect associations 



8 Andrew G Clark & Jian Li 

| Human genome-wide association studies pose a challenge in identifying significant disease associations from nearly 

| half a million statistical tests. A new report describes an especially promising approach, recently applied to the 

| Wellcome Trust Case Control Consortium data sets, that uses the correlated structure of genomic variation to impute 

| genotypes at missing sites and to test association with both observed and imputed SNPs. 
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Genetic mapping has always relied on statis- 
tical inference, but this enterprise has never 
been so utterly dependent on rigorous ana- 
, lytical methods as it is with genome-wide 
association studies (GWASs). For each of the 
. nearly 500,000 SNPs in the human genome 
scored by widely used genotyping platforms 
for GWASs, it is possible to perform a sim- 
ple statistical test of association with dis- 
ease state. Even if the null hypothesis of no 
association were true for all SNPs, we would 
of these tests to provide nomi- 
on the order of 10~ 6 . In order to 
ositive calls, we need to identify 
ich the P values are even lower, 
rease the power to appropriately 
} reject the null hypothesis (that is, to correctly 
) infer that a SNP is truly associated with dis- 
ease) by elevating the sample size or restrict - 
v ing attention to intermediate-frequency SNPs 
Sand by being judicious in our choice of test. 
In this issue, Marchini el id. 1 (page 906) show 
that thoughtful application of population 
genetic principles and use of HapMap data 
can provide an additional source of power 
for association tests. They have successfully 
applied these methods to the Wellcome Trust 
Case Control Consortium (WTCCC) data 2 
and have identified a collection of new genes 
associated with seven complex medical dis- 
orders (see pages 813-815 of this issue for 
discussion of the WTCCC studies). 

Imputation to boost power 

The more genetic data that we have for each 
individual, the greater the chance of finding 
variants that influence disease risk directly. 
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statistically inferred or 'imputed' from the 
observed genetic data. To see how imputation 
can give a boost in the power of tests of asso- 
ciation, consider the situation where a SNP 
that has a direct effect on disease risk is in the 
HapMap set of SNPs but is not on the 500K 
genotyping platform used in a given GWAS 
(Fig. la). In this case, if only the observed 
marker SNP were used, the association test 
would be weakened by any observed depar- 
ture from perfect linkage disequilibrium 
between the observed SNP and the unob- 
served risk-enhancing SNP. This contrasts 
with the hypothetical case (Fig. lb) in which 
the risk-enhancing SNP is observed directly. 
If no other genetic variation in this genomic 
region influences risk, then the test based on 
this SNP alone will be the most powerful. 
One can see that such a direct test provides 
a greater chance to detect a significant asso- 
ciation. Because we often do not observe the 
risk-enhancing SNP directly, imputation can 
be used to close some of the gap between 
these two extremes. High linkage disequi- 
librium in the human genome means that 
we can impute the unobserved genotype of 
many of the missing SNPs with surprisingly 
high accuracy (>98% in many cases). This 
accuracy will be reduced in regions of the 
genome with unusually high recombination 
rates (for example, SNPs within hotspots). 
The example in Figure lc is for an impu- 
tation accuracy of 99%, and it is clear that 
the probability of detecting the association 
is much greater than in Figure la, where we 
did not apply imputation. Marchini et al. 1 
and Scott et aV use multiple flanking SNPs 
to impute missing SNP genotypes, and they 
find that the P values for tests of association 

the imputed SNPs than with the observed 
SNP data only. 



This may seem like sleight of hand, 
because there seems to be a gain in power 
without any additional information, as the 
missing SNPs are imputed from the observed 
marker SNPs. One might think that tests 
based only on haplotypes of the observed 
SNPs 4-6 would do just as well, because they, 
after all, are what allows prediction of the 
missing SNPs. But the method does incor- 
porate haplotype information of observed 
SNPs along with the linkage disequilibrium 
structure of the full HapMap sample to 
perform the imputation. By leveraging the 
observed marker SNPs and by predicting 
missing data from the pattern of linkage 
disequilibrium in the HapMap data, we get 
the best of both worlds. 

Testing association 

In a GWAS, the meaning of a P value becomes 
challenged in the context of so many simul- 
taneous tests. One solution to this problem 
is to calculate the false discovery rate 7,8 ; 
however, this approach was developed for 
testing a single hypothesis, as opposed to 
simultaneously testing a battery of SNPs 
associated with a disease. Association test- 
ing can be done with standard frequentist 
methods like logistic regression, where the 
model may specify either allelic or genotypic 
effects. Likelihood methods can be used to 
deal with the uncertainty in the imputations 
of missing genotype data. Bayesian methods 
also allow inference of probability of asso- 
ciation conditional on observed genotype 
data and can accommodate imputed geno- 
types easily. Marchini et al. 1 make use of one 
useful measure of the relative likelihood of 
association, the Bayes factor, a term closely 
c 1 i ' H mod t iii defined in this 
case as the probability of the observed data, 
given that the association is real, divided by 
the probability of the observed data under 
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The road to genome-wide 
association studies 

Leonid Kruglyak 

Abstract | The recent crop of results from genome-wide association studies 
might seem tike a sudden development. However, this blooming follows a long 
germination period during which the necessary concepts, resources and 
techniques were developed and assembled. Here, I look back at how the 
necessary pieces fell into place, focusing on the less well-chronicled days before 
the launch of the HapMap project, and speculate about future developments. 



Genome-wide association studies 
(GWAS) use dense maps of SNPs that 
cover the human genome to look for 
allele-frequency differences between 
cases (patients with a specific disease or 
individuals with a certain trait) and con- 
trols. A significant frequency difference is 
taken to indicate that the corresponding 
region of the genome contains functional 
DNA-sequence variants that influence the 
disease or trait in question. The recent 
crop of results from GWAS (reviewed 

in REFS 1 4) might seem like a sudden 

development. However, this blooming 
follows a long germination period during 
which the necessary concepts, resources 
and techniques were developed and 
assembled. Here, I look back at how the 
necessary pieces fell into place, starting 
with early ideas and continuing with 
concrete proposals and theoretical and 
empirical studies that laid the foundation 

close by contemplating the implications of 
the lessons that were learnt from the initial 
crop of GWAS for future studies of human 
genetic variation. 

Early milestones 

Genome- wide approaches to human 
genetics date back to the proposal in 
1980 by Botstein and colleagues for the 
construction of a linkage map of the 
human genome, with restriction frag- 
menl n i polynn hisms LPs) 



molecular markers 6 ' 7 . The natural initial 
applications were to genetically simple 
Mendelian diseases; however, as early as 
1986, even before the first linkage map wa 
completed, Lander and Botstein recog- 
nized that most human traits and diseases 
follow complex modes of inheritance, 
and they discussed several approaches 
for studying such complex traits 8 . One of 
the approaches they proposed was link- 
age disequilibrium (LD) mapping, which 
recognizes that a mutation that is shared 
by affected individuals through common 
descent will be surrounded by shared 
alleles at nearby loci, representing the 
haplotype of the ancestral chromosome on 
which the mutation first occurred [FIG. I). 
The first example of LD between a DNA 
polymorphism and a disease mutation 
was provided by an association between 
an allele of an RFLP in the fi-globin gene 
and the sickle-cell form of haemoglobin 9 . 



[Genetic] complexity is 
present on multiple levels, 
and might be fruitfully thought 
of as 'fractal'. 



Simple population-genetics arguments 
suggested that LD in the general human 
population would probably be limited to 
distances below 100 kb 10 . For this reason, 



Lander and Botstein deemed LD mapping 
to be impractical in the general population 
owing to the high marker density that 
would be required, but they proposed 
that a map of hundreds of RFLPs might 
suffice for LD mapping in recently 
founded isolated populations. 

The first complete RFLP map of the 
human genome was reported in 1987 
{REP. 1 1 ), but human mapping studies really 
flourished once microsatellites replaced 
RFLPs 12 (BOX 1 !. Genome-wide studies 
used family-linkage approaches almost 
exclusively, with LD being used to refine 
the locations of genes that were mapped 
by linkage, as pioneered by Kerem and 
colleagues for the cystic fibrosis gene 13 . In 
a groundbreaking and forward-looking 
study in 1 994, Houwen and colleagues 
reported the first application of LD map- 
ping in an initial whole-genome search for 
a disease locus'' 1 . Following the approach 
that was envisioned by Lander and 
Botstein almost a decade earlier, they used 
256 markers to map the gene responsible 

till ! . 

j in an isolated fishing com- 
munity in the Netherlands. Their success 
relied on the rarity of the disease and on 
the availability of a population isolate in 
which the affected individuals were distant 
relatives. A similar study in a Mennonite 
kindred allowed Puffenberger and col- 
leagues to identify a gene tor 

. i hme\et, such studies, which 
straddled the border between family 
linkage and LD mapping, remained the 
exception as linkage approaches domi- 
nated. Studies of many pairs of relatives 
(most commonly, affected sib pairs) were 
especially prevalent owing to the ease of 
collecting such samples versus samples 
from extended families, to their theoreti- 
cal appeal for mapping complex traits 16 ls 
and to the availability of powerful analysis 
tools These genome scans were carried 
out for many common diseases that show- 
complex inheritance, but they failed to 
find many reproducible loci. With these 
findings, the initial belief that a few major 
genes would explain susceptibility to 
complex diseases gave way to the realiza- 
tion that the level of complexity was much 
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higher and that many loci of individually 
small effect were involved. Because such 
loci are difficult to identify by family link- 
age owing to limited power, the search was 
on for new approaches. 

Modern proposals 

In an influential perspective, Risch and 
Merikangas argued that association 
studies should be more powerful than 
family linkage studies for detecting high- 
frequency, small-effect polymorphisms 20 . 
Linkage studies rely on allele sharing by 
descent among affected relatives, and 
their low power to detect such polymor- 
phisms is due to two factors. First, when 
the increased risk conferred by an allele 
is small, some relatives will be affected 
because of other causes and will not carry 
the risk allele. Second, when an allele is 
common, it can enter the family through 
multiple founders, erasing clear inheritance 
patterns. These effects combine to 
decrease sharing by descent to the point 
at which an impractically large number 
of families must be studied to detect it. 
Association studies still suffer from the 
first effect (which is inherent to searches 
for small effects) but not the second, and 
therefore they have a higher sensitivity 
in detecting common variants with small 
effects. The increase in power is such 
that even testing large numbers of poly- 
morphisms, with the ensuing statistical 
costs of multiple testing, does not erase 
the advantage of association studies 20 - 21 . 
In addition to offering higher power with 
the same sample sizes, association studies 
also have the practical advantage that large 
samples of unrelated cases and controls 
can be collected much more easily than 
family-based samples. 

Risch and Merikangas issued a call for 
a catalogue of all variants in human genes, 
and set out a challenge "to the molecular 
technologists to develop the tools" for 
their identification and genotyping 20 . 
This call was echoed by Eric Lander, 
who hypothesized that common variants 
of modest effect might hold the key to 
susceptibility to common diseases 21 (this 
was subsequently codified as the common 
disease-common variant hypothesis). 
Lander also noted that the role of noncod- 
ing variation might be studied by the use 
of LD mapping with a sufficiently dense 
polymorphism map. 

These proposals were formalized the 
following year in a policy forum by Collins, 
Guyer and Chakravarti 22 . They made 
explicit the distinction between the direct 



approach of cataloguing all common func- 
tional variants and the indirect approach 
of relying on a dense map of SNPs for LD 
mapping (FIG 2). A back-of-the-envelope 
calculation put the likely size of shared 
ancestral haplotypes in the range of 
10-100 kb, leading to a proposal to identify 
at least 100,000 SNPs. To achieve this goal, 
The SNP Consortium, a public-private 
partnership, was launched in 1999. 

Charting the course 

The early proposals for genome-wide stud- 
ies were audacious, because the number of 
SNPs known at the time was small, and the 
approaches to their discovery and geno- 
typing were cumbersome. In 1998, Wang 
and colleagues performed an important 
feasibility study, discovering some 3,000 
SNPs and developing an array-based geno- 
typing approach that could assay hundreds 
of SNPs in parallel 23 . The SNP consortium 
and the HapMap project would eventually 
bridge the gap between this early survey 
and the much larger number of SNPs 
required. 

How many SNPs are needed? The number 
of SNPs that are required for LD mapping 
obviously depends on the genomic extent 
of LD because genotyped SNPs must be 
spaced sufficiently densely to be in LD 
with most of the (potentially disease- 
associated) variants that are not genotyped. 
At the time the proposals for GWAS were 
made, few empirical estimates of the extent 
of LD were available, and these varied 
wildly from observations of LD over hun- 
dreds of kb to the breakdown of LD at very 
short distances. This range of observations 
translated into an uncertainty of up to 
three orders of magnitude in the required 
number of SNPs — from thousands to mil- 
lions. Even several years later, the number 
of SNPs required for GWAS was said to be 
in the range of 30,000-1,000,000, based 
on a survey of empirical studies 24 . Starting 
in 1997, 1 attempted to reduce this uncer- 
tainty by using simple population-genetics 
models to calculate the likely extent of 
LD. A highly realistic model could not be 
constructed at the time because of a lack 
of detailed information regarding both the 
demographic history of different popula- 
tions and the variation in recombination 
rate at short distances. Instead, the aim 
was to obtain a reasonable estimate. In the 
model that was designed to approximate 
the global human population, moderate 
levels of LD were confined to regions of 
approximately 6 kb, thus leading to the 



prediction that some 500,000 SNPs would 
be required for GWAS, even if relatively 
low LD levels between mapped SNPs and 
functional variants were deemed accept- 
able 25 . The predicted number of SNPs was 
considerably larger than the goals of SNP 
discovery projects at the time 26 , and led to 
an increase in the targeted number. Indeed, 
less than 2 years later, a map of 1 million 
SNPs was reported 24 . Given the simpli- 
fied nature of the model that was used to 
calculate the estimate of 500,000 SNPs, 
this number has held up remarkably well 
— most of the recent successes of GWAS 
came when approximately this number of 
SNPs could be genotyped within individual 
studies, and the current generation of com- 
mercial SNP-typing products deploys some 
500,000-1,000,000 SNPs. 

When the prediction is viewed from the 
vantage point of the extensive empirical 
data available today (for example, REF 27), 




Figure 1 1 Linkage disequilibrium around an 
ancestral mutation. The mutation is indicated 
by a red triangle. Chromosomal stretches that 
are derived from the common ancestor of all 
mutant chromosomes are shown in light blue, 
whereas new stretches introduced by recombi- 
nation are shown in dark blue. Markers that are 
physically close (that is, within the light-blue 
regions of present-day chromosomes) tend to 
remain associated with the ancestral mutation, 
even as recombination whittles down the 
region of association over time. 
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Box 1 j A brief history of genetic markers 



Human genetic mapping was initially based on restriction fragment length polymorphisms 
{RFLPs) 6 --'' 5 - 5 ' 1 — fragment length variants generated through the presence and/or absence of 
restriction enzyme recognition sites 6 - 11 - 53 - 54 . RFLPs resulted from various sequence changes 
including base substitutions, insertions and deletions, and were laboriously assayed by 
Southern blots. Southern blots were superseded by PCR-based assays for microsatellite 
markers (also known as short tandem repeats or simple sequence length repeats) 10 - 51 . 
Microsatellites are di-, tri- or tetranucleotide repeat sequences that are composed of many 
tandem repeats. Many alleles are generally associated with each microsatellite within most 
populations, hence their use as markers for carrying out family-based linkage analysis 1 " 5 . 
More recently, SNPs have become the markers of choice; their lower polymorphism is offset 
by their abundance and ease of genotyping"- 56 , and their low mutation rates make them 
especially suitable for linkage disequilibrium mapping. 



it is clear that the actual average extent 
of LD is greater than in the model. This 
is especially true in non-African popula- 
tions, most likely owing to a combination 
of demographic factors and a clustering of 
recombination events at hot spots 28 (both 
of these effects were anticipated when the 
prediction was made 25 ). However, the cal- 
culation of the number of SNPs assumed 
both a relatively low acceptable level of LD 
and coverage of each region of LD with a 
single SNP. In practice, GWAS have set 
a higher standard for required LD, and 
multiple SNPs are used to 'tag' each region 
of high LD. These factors combined lead 
to the most current empirical estimates 
of approximately 500,000 SNPs for non- 
African and 1,000,000 SNPs for African 
populations to ensure adequate coverage 
of the genome in GWAS, even when high 
LD levels between mapped SNPs and 
functional variants are required 29 . 

How should the SNPs be chosen? Initially, 
the discussion focused on simply assem- 
bling a dense collection of SNPs. However, 
both theoretical considerations and early 
empirical studies suggested that the physi- 
cal extent and the local patterns of LD 
were likely to vary across the genome and 
among populations. In commenting on 
one empirical study- 10 in 1999, 1 proposed 
that LD among a dense collection of 
SNPs be measured empirically across the 
genome and in different populations in 
order to identify the most efficient SNP 
panels for association studies (FIG. 3); 
such panels would vary in their density 
by region of the genome and by popula- 
tion". The result of such empirical studies 
would constitute an LD map of the human 
genome As s x ' d efforts con- 

tinued, empirical data confirmed both the 
need for hundreds of thousands of SNPs 
and the fact that these SNPs could not be 
chosen at i dom or by un m spacing 



across the genome 5 - 32 Rather, a million 
or more SNPs would need to be genotyped 
in substantial numbers of individuals from 
multiple populations in order to select 
sets of several hundred thousand (with 
the precise number depending on the 
population) that would efficiently capture 
untyped common variants 5 - 32 54 . These 
observations gave rise to the HapMap 
project and a parallel effort by l\ ; bwen 
.which eventually joined forces 
to produce the SNP panels that are being 
used today 2 "''- 37 . These projects also drove 
the development of rapid and cost-effective 
genotyping technologies, setting the 
stage for GWAS. These recent develop- 
ments are well chronicled elsewhere (for 
example, REF. 38). 

The road ahead 

In the past two decades, GWAS have 
progressed from visionary proposals, 
made when neither the sequence of the 
human genome nor many variations in 
this sequence were known, to routine 
practice of screening 500,000-1,000,000 
SNPs in thousands of individuals. The 
recently reported phase 2 of the HapMap 29 



now includes 3 million SNPs, estimated 
to cover one-quarter to one-third of all 
human SNPs with frequencies above 5%. 
Where do we go from here? 

The recent crop of discoveries f rom 
GWAS is a major advance in our under- 
standing of the genetic basis of common 
diseases, as well as not ma! human varia 
lion 39 - 40 . Nevertheless, the associated loci 
that have been identified usually have 
small individual effects on phenotype, 
and even collectively tend to explain only 
a small fraction of the heritable com- 
ponent', for some diseases studied, no 
significant loci have been identified M1 . 
This failure to detect loci that explain 
the bulk of the heritable components of the 
phenotypes studied could be attributable to 
several factors. First, because the detected 
loci have small effects, the power to detect 
them is low, and more such loci remain 
to be discovered as sample sizes increase. 
Second, association studies can only 
detect the effects that are due to relatively 
common alleles. Rare alleles remain to 
be discovered — both at the loci that are 
identified by GWAS because they also have 
common alleles with phenotypic effects, 
and at other loci that do not have such 
common alleles. The former can be found 
by focused resequencing studies of the 
loci identified by GWAS; finding the latter 
might require resequencing of other genes 
in the relevant pathways, of the exons 
of all genes' 12 " 44 or of the entire genome. 
Third, we might be missing the effects of 
structural variation, of other less well- 
studied types of genome alterations 45 , and 
of interactions among variants and between 
genetic and environmental factors. 

It is only a matter of time before all 
SNPs with appreciable frequencies in the 
human population have been discovered. 
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Indeed, efforts to discover and genotype 
additional SNPs in larger and more 
diverse population samples are underway. 
Assuming that the genotyping technologies 
can keep up the pace, the indirect associa- 
tion studies relying on LD will be replaced 



by d 



s that 



all 



relatively common SNPs (perhaps the 
estimated 11 million SNPs with minor 
allele frequencies exceeding 1% in the 
population' 16 ), although it will probably 
still be worthwhile to exclude wholly 
redundant SNPs. Thus, LD and haplotype 
maps are merely useful but temporary 
shortcuts. An interesting finding of 
phase 2 of the HapMap is that for approxi- 
mately 1% of all SNPs (tens of thousands), 
the basic assumption of indirect associa- 
tion mapping breaks down — these SNPs 
are not in LD with any others, often owing 
to their location in hot spots of recombi- 
nation, and are thus 'untaggable' and must 
be assayed directly for phenotypic 
associations 2 *. Tailored approaches that are 
under development will cover structural 
variants 47 . Studies based on the resequenc- 
ing of individual genomes (rather than 
genotyping of known variants) will be 
needed to begin to comprehensively 
address the role of rare variants and 
de novo mutations, and will eventually 
replace genotyping studies altogether, 
although this is likely to take some time. It 
is worth noting that resequencing studies 
of rare variants have to rely on the rec- 
ognition of many different variants, each 
of which alters the function of the same 
gene or pathway in different individuals. 
Whereas recognition of likely functional 
variants in coding regions is straightfor- 
ward, detecting functional changes in 
noncoding DNA poses a major challenge. 
This is because regulatory sequences can 
be located far from the coding region 
and are often difficult to identify, and we 
do not have a ready connection between 
nucleotide differences and function for 
these sequences. 

Looking further ahead, we can already 
envision the day when the genome 
sequences of a significant fraction 
of the population are known, at least 
m the developed world. Assuming that 
the relevant logistical and ethical issues 
can be solved, what will we learn from 
combining this unprecedented scope of 
genetic information with medical records 
and other phenotypic data? We are just 
beginning to get the first glimpses of the 
real underlying genetic complexity of phe- 
notypic variation. Complexity is present 
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Figure 3 | Schematic of a genomic region to be tested for association with a phenotype. The 

four reference SNPs in the mapping panel are indicated by red triangles; these are genotyped 
directly. The eight SNPs indicated by yellow triangles are captured through linkage disequilibrium 
(by proxy) with the reference SNPs denoted by arrows. The four SNPs indicated by blue triangles 
are neither genotyped nor in linkage disequilibrium with the reference SNPs; phenotypic 
association that is due to one of these would be missed. 



on multiple levels, and might be fruitfully 
thought of as 'fractal'. First, many loci are 
involved; we do not yet know how many 
but the number could be in the hundreds 
for many traits. Second, individual loci 
can often represent variation in multiple 
linked genes, as has been found in model 
organisms (for example, REF. 48). Third, 
each gene is likely to contain multiple 
functional variants, including both 'super- 
alleles' of linked alterations on one haplo- 
type and allelic series with a range of allele 
frequencies and effect sizes. Non-additive 
interactions can be present at all levels. 
GWAS detect effects at the locus level, 
and an important challenge for future 
studies is identification of the genes, the 
functional variants and the functional 
mechanisms underlying phenotypic asso- 
ciations. Currently, such studies require 
painstaking, low-throughput experiments 
in cell lines and animal models. 

It is possible that some genetic contri- 
butions to human phenotypic variation 
might be too subtle to unravel, even 
when our surveys of the genome become 
truly comprehensive and the sample sizes 
approach that of the human population. 
Aside from the question of how much of 
the population variation we will ultimately 
be able to explain, we also have to ask 
how we can piece together individual risk 
from many small genetic contributions. 
Will we ultimately be able to classify 
individuals into meaningful groups with 
regard to risk of specific common diseases 
or response to drugs, as envisioned in 
personalized medicine? Doubtless this 
will be (or already is) true in some cases, 
but it is currently unknown how general 
such classifications are. We might need 
to replace some current phenotypic and 
disease classifications with ones that bet- 
ter correspond to the underlying genetic 
causes, perhaps by developing methods 



to iteratively refine phenotypic categories 
by combining genotypic and phenotypic 
information. Careful and detailed meas- 
ures of phenotypes and environmental 
exposures will also have an important 
role. Clearly, we have a lot of work to do 
before an individual genome sequence is 
more phenotypically informative than 
it is today 1930 . In the meantime, great care 
is required in offering genome-based 
information to individuals 5 ' 32 . 

Concluding remarks 

What is the best future direction for 
human genetics? There are essentially three 
avenues to pursue: much larger samples; 
better assays of genome variation that can 
capture both common alterations that 
are not in LD with SNP panels and rare 
variants; and more detailed phenotyping. 
Undoubtedly, each of these approaches 
has a role, and we do not yet have all the 
information needed to decide which will 
prove most fruitful. Therefore, it is a high 
priority to apply a full battery of approaches 
to several model diseases and phenotypes 
in order to empirically determine the range 
of outcomes, just as the Wellcome Trust 
Case Control Consortium study of seven 
diseases provided an empirical guide for 
GWAS". In my opinion, the most pressing 
question is the contribution of rare variants, 
both in the genes that harbour common 
risk variants and in those that do not. This 
question is also the most difficult to address 
comprehensively with todays technologies, 
but it seems imperative that we prioritize 
studies to begin to get the answers. 

Leonid KruglyaH is at the Lewis-Sigler Institute for 
Integrative Genomics and the Department of Ecology 
and F.v>i;iti'inanj Biology. Princeton Univo.r : ,ilii. 
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Genome-wide association studies: 
progress and potential for drug 
discovery and development 

Stephen F. Kingsmore, Ingrid E. Lindquist, Joann Mudge, Damian D. Cessler 
and William D. Beavis 

Abstract | Although genetic studies have been critically important for the identification of 
therapeutic targets in Mendelian disorders, genetic approaches aiming to identify targets 
for common, complex diseases have traditionally had much more limited success. However, 
during the past year, a novel genetic approach — genome-wide association (CWA) — has 
demonstrated its potential to identify common genetic variants associated with complex 
diseases such as diabetes, inflammatory bowel disease and cancer. Here, we highlight some 
of these recent successes, and discuss the potential for GWA studies to identify novel 
therapeutic targets and genetic biomarkers that will be useful for drug discovery, patient 
selection and stratification in common diseases. 
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Genetic factors are known to have an important role in 
many common diseases, and the identification of genetic 
determinants for such diseases has the potential to pro- 
vide insights into disease pathogenesis, revealing novel 
therapeutic targets or strategies. Genetic factors could 
also provide useful biomarkers for diagnosis, patient 
stratification and prognostic or therapeutic categoriza- 
tion. In addition, given that inherited genetic factors are 
present at birth, knowledge of these factors could facilitate 
timely preventative or ameliorative interventions. 

During the past 25 years, genetic linkage-based stud- 
ies have proved very effective in identifying causal 
genetic factors in Mendelian (single gene) disorders; 
causal genes for more than 1,300 dominant and recessive 
Mendelian diseases have been identified 1 . Most common 
diseases and ( - , , however, do not exhibit 

Mendelian inheritance, but rather feature complex, 
multifactorial expression and inheritance. Although 
linkage based methods have been broadly applied, 
these studies have had little success in identifying the 
allelic determinants of common disorders 2 . In particular, 
there has been poor replication among studies, whereby 
an initial study identifies an allele (genotype) with large 
estimated genetic effects (relative risk) but subsequent 
studies fail to corroborate the results 1 - 4 . In part, this 
reflects the dependence of linkage-based studies on 
unusually informative families (with multiple affected 
and unaffected individuals), which induce a bias toward 



rare, semi-Mendelian disease subsets in subpopulations. 
Reports of successful identification of genetic variants in 
common diseases using an approach that circumvents 
this limitation — genome-wide association (GWA) studies 
— have therefore generated considerable excitement. 

Human GWA studies are based on three hypotheses: 
First, the common trait/common variant hypothesis 
proposes that the genetic architecture of complex traits 
consists of a limited number of common alleles, each 
conferring a small increase in risk to the individual 5 - 6 ; 



>nd, the bri. 
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located, common (ancient) variants; and, third, sup- 
very frequently. Thus, approximately 80% of the human 
genome is comprised of around 10 kb regions that 
exhibit reduced recombination in human populations 
(haplotypes) 7 . Genetic variants (alleles) within haplo- 
types are in >- t - (LD). This phenom- 
enon enables much of the recombination history in a 
population to be ascertained by genotyping a large set 
of well-spaced, common (ancient) variants throughout 
the genome, especially if variant selection is informed by 
knowledge of haplotypes. During the last 10 years, more 
than lOmillion i i ieotidepol lorphisms (SNPs) 
ha%'e been identified 8 . Furthermore, the International 
HapMap project has genotyped approximately 4 million 



NATURE RE\ I Vs 



© 2008 Nature Publishing Group 



,ii 



Box 1 1 Useful 

• Genetic Associ 



and databases for genetic- based studies 



>n Database: An archive of human genetic association studies of complex diseases, 
khizophr iaG. neDataba e An archive of genetic association studies performed on schizophrenia phenotypes. 

• Online Mendetian Inheritance in Man: A catalogue of human genes and genetic disorders. 

• Human Gene Mutation Database. A catalogue of published gene lesions responsible for human inherited disease. 

• Human Genome Variation Database: A catalogue of normal human gene and genome variation. 

• dbSNP: A catalogue of human single nucleotide polymorphisms. • • 

• GeneSNPs: A database of polymorphisms in human genes that are thought to have a role in susceptibility to 
environmental exposure. 

• PharmGKB: A database of pharmacogenomics research 

• GeneCards: A database of human genes that includes genomic, proteornic and transcriptomic information, as well as 
orthologies. disease relationships. SNPs. gene expression and gene function 
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n SNPs (occurring with a 
of more than 5%) in human populations and has assem- 
bled these genotypes computationally into a genome- 
wide map of SNP-tagged haplotypes 7 . These resources, 
together with array technologies for massively parallel 
SNP genotyping and the well-established epidemiologi- 
cal case- control issoi iation st idles have rendered GWA 
feasible [BOX1.FlG.lj. 

Initial genetic association studies focused on candidate 
loci and exhibited a lack of replication among studies' 10 . 
There were biological explanations for inconsistent 
results: unobserved, confounding biological sources of 
heterogeneity, including inconsistent or poorly defined 
measurements of the phenotype, heterogeneous genetic 
sources for the phenotvpc ( i i, population 

stratification (ethnic ancestry), population-specific 
LD, heterogeneous genetic and epigenetic backgrounds or 
heterogeneous environmental influences (phenocopies). 
In addition, there were statistical reasons for ^repro- 
ducibility, including failure to control the rate of false 
discoveries, model misspecification and heterogeneous 
bias in estimated effects among studies 11 ". Also, a fre- 
quent source of non-replication was lack of power due 
to the limited number of individuals genotyped and 
phenotyped 15 '"'. 

In order to ameliorate poor replication, GWA experi- 
ments employ multi-tiered experimental designs with 
discovery, replication and biological validation stages 17 
!HC. i). Tiered designs are critical for cost-effective 
detection of meaningful, hypothesis-generating, geno- 
type-phenotype associations given the large number 
of comparisons involved, prior probability estimates of 
association, sample sizes, resampling procedures and 
statistical significance thresholds. GWA studies also owe 
their statistical power to their large cohort size and high 
rate of SNP detection. Currendy, a respected threshold 
for uncorrected, significant associations is P <5 x 10 7 
(REFS 1 8, 1 9). Alleles with moderately less significant asso- 
ciations, however, are often also reported, as they might 
'« 1 >ei that reach the afoi ntn t 1 ' i h 
subsequent studies. 



Results of initial GWA studies 

The first GWA study, published in 2002, evaluated 
acute itok (AMI) 20 . The discovery, 

or nomination, phase comprised the examination of 
genotype-phenotype association signals in 65,671 coding 
domain SNPs (cSNPs) in 752 cases and controls (TABLE 1 ). 
Although subsequent studies have used up to 20 times 
this number of non-coding SNPs, gene-tagging SNPs are 
more informative, as the majority of true-positive associa- 
tions are expected to be with genes 1 . Even more informa- 
tive are screens that employ functional cSNPs, such as 
nonsynonymous SNPs (nsSNPs), that are candidate, 
causal (risk-enhancing) gene alleles'- 21 ~ a . The replication, 
or confirmatory, phase examined associations of 26 SNPs 
in 2,137 individuals and confirmed association of AMI 
with a 50 kb region containing lymphotoxin-cc (I TA ), 
nuclear factor of kappa light polypeptide gene enhancer 
in B cells (also known as RELA), nuclear factor of kappa 
light polypeptide gene enhancer in B cells inhibitor-like 1 
(MKKIi ,') and human leukocyte antigen (HLA)-B 
associated transcript 1 (MID genes. Additional replica- 
tion studies have been undertaken, some of which have 
confirmed an association of this region with AMI-related 
phenotypes and, in particular, one nsSNP in LTA 29 3S . The 
association ot'LTA with AMI was an unexpected finding, 
suggesting a novel therapeutic target. 

A second, pioneering GWA study examined 

( U1D)' \Ht l [he discover) 
phase sought associations of 105,980 SNPs with AMD 
in 96 cases and 50 control individuals. Despite the small 
cohort size, SNPs in the complement factor H {Li's i) gene, 
including an nsSNP, showed significant association with 
AMD. Replication was not performed, but subsequent 
studies have replicated associations of CFH alleles with 
AMD 37 " 40 . Of all common diseases examined by GWA to 
date, AMD is unique in that a single haplotype explains 
61% of the genetic variance, conferring a homozygous 
odds ratio of 7.4. To put this in perspective, this is of a 
imilar mag to the el t- ions ot 

with ^ • and ]u 4 a n e 

] es with . . 1 Tl DM). Complement 
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• Matched for confoundir 




ethnicity and sex 
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Microarray-based SNP genotyping 

• -1 million random marker SNPs or 
-25.000 risk-enhancing SNPs (for example, nsSNPs) 



Detection of association signals 

• x> or similar test 

• Uncorrected P <10 7 or false discovery rate-like 
correction 



Fine mapping of association signal (see FIG. 2) 

• Dire ted genotyping of additional SNPs in region 

• Fine mapping of LD in region of association 

• Empirical derivation of haplotypes 

• Examination of effect of stratification, if available 
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Replication of association 

• Large independent cohort of cases and contrc 
[n >!.000) 

• Genotyping of nominated candidate SNPs (<2< 

• x' ° r similar test; replication of initial signal 



I 



Biological validation of association 

• /Identification of risk-enhancing variant 

• Examination of functional consequence of variant 

• Determination of mechanism of risk-enhancement 



Genotypes CC AA CA Total 

Cases observed 59 27 98 184 

Controls observed 60 89 36 185 

Total 119 116 134 369 



Figure 1 1 Overview of the general design and workflow of a genome-wide 
association (CWA) study. The discovery phase entails genotyping many case and 
control DNA samples and evaluation for significant associations. The replication phase 
involves fine mapping of association signals and independent confirmation in a second 
cohort. Biological validation is important for translation of CWA findings into diagnostic 
or therapeutic discoveries. 



pathway dysregulation was a novel, unexpected asso- 
ciation with AMD. Subsequent studies have shown an 
association of AMD with two additional members of the 
alternative complement pathway (factor B (CFB) and 
Q)4i.42_ T nese findings, together with biological valida- 
tion studies, have led to the initial development of new 
AMD therapies, based upon complement inhibition. 

In the past year, technical challenges associated with 
GWA were largely overcome, genotyping costs were 
decreased and a significant number of studies have used 
SNP genotyping arrays in larger population groups to 
produce replicated associations between individual SNP 
alleles and common diseases. 

Inflammatory bowel disease. Five large GWA studies 
have examined C rohn's disease and ulcerative colitis, 
two histologically distinct types of inflammatory bowel 
disease (IBD) (TABLE 1 ). Four of the studies used micro- 
arrays featuring between 300,000 and 400,000 SNPs' 8,43 " 45 , 



whereas the fifth study genotyped approximately 16,000 
nsSNPs 24 . Two follow-up studies sought to replicate 
the most significant signals from the Wellcome Trust 
Case Control Consortium (WTCCC) study 13 , one in a 
European population and another in a Japanese popula- 
tion 4 " 7 . The European study replicated significant sig- 
nals of the WTCCC study, but some of the alleles failed 
to reach significance in the Japanese study and others 
were not detected. The failure to replicate signals in 
different studies might reflect true differences between 
populations, differences in phenotype ascertainment or a 
lack of power. 

Considering the six studies of European populations, 
there was significant replication of specific allele associa- 
tions with Crohn's disease (TABLES 1 .2). Three associations 
were concordant in four out of five studies (representing the 
genes caspase recruitment domain 15 protein (CARD 15, 
also known as NOD2), interleukin 23 receptor (11.23 R) 
and ATG16 autophagy related 16-like 1 (ATGI6L1)). 
Of note, CARD15 had previously been identified as a 
susceptibility gene by linkage-based approaches 48 ''". One 
gene, prostaglandin E receptor 4 (PTGER4) showed 
association in two out of five studies. In addition, several 
disease-associated intergemc segments have been repli- 
cated IBD susceptibility genes that have been identified to 
date appear to coalesce into biological networks involving 
innate immunity, autophagy and phagocytosis 50 . In addi- 
tion, alleles of two genes associated with Crohn's disease 
(11.23R and i y l'P\'2) have shown association with other 
autoimmune disorders 2 ' 51 , suggesting the existence of 
autoimmune susceptibility 'supergenes'. There is great 
interest in alleles that exhibit pleiotropic associations, 
as they potentially represent blockbuster targets that 
cross-over therapeutic categories (TABLE 2). 

In common with most GWA studies to date, estimated 
genetic effects (relative risks) of IBD-associated loci are 
small 18 ''' 6 . However, as many of these variants were com- 
mon, the population attributable risk — an estimate of 
the percentage of cases of disease that would be avoided 
if the allele(s) were absent — was substantial. Of several 
studies that looked for epistatic interactions between IBD 
association signals, two found suggestive evidence of 
epistasis involving two different pairs of genes 24 ' 52 . 

Diabetes mellitus. A good example of the capabili- 
ties and limitations of GWA studies is type 2 diabetes 
meljitus' 8 ' 53 - 57 (T2DM; TABLE l). Two studies examined 
association both with SNPs and haplotypes in the 
discovery phase 54 57 . Haplotype-based analysis can be 
more powerful than marker-by-marker analysis in 
association studies 22 58 " 61 . For example, haplotypes can 
correlate a specific phenotype with a specific gene in 
a small population sample even when individual SNPs 
cannot 62 . Case-control and family-based association studies 
were employed in several studies of T2DM. 

The replication phases of these studies were impres- 
sive; two of them included over 9,000 replication individ- 
uals 53 - 54 . One study sought to replicate signals identified 
by the WTCCC study 13 by genotyping the most signifi- 
cant SNPs; 9 of 77 candidate SNPs reached a P <5 x 10 7 
significance level 63 . The eight genes represented by these 
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SNPs were replicated in at least one other independent Reassuringly, some of the genes identified by GWA in 

study (TABLES 1-3). The concordance of T2DM-associated studies of TD2M have pie wo jsh Wvv issuoated nith 

genes between GWA studies is striking: of 10 novel the disease in other types of genetic studies. For example, 
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Table 1 1 Discovery and replication designs of recent GWA studies 



Disease 


Discovery Phase 




Replicatio 


n Phase 




Refs 




Number of 


Number of 


Population 


Number of Number of 


Population 






individuals 


SNPs 




individuals SNPs validated/ 






examined 






examined 


tested 






AMD 


146 


105.980 


Caucasian 


96 


2/50 


Same 




Asthma 


2.642 


307,328 


UK/German 


2.320 


0/9 


German 


92 


Atrial fibrillation 


5,026 


316,515 


Icelandic 


17.810 


2/18 


Icelandic/ 


94 


Bipolar disorder 


1,024 (pooled) 555,235 


Western European 1.648 


1/37 




91 


Breast cancer 


754 


227,876 


European 


45.426 


7/30 


Same 


85 




2.287 


528,173 


European 


3,848 


1/8 


Same 


19 




13,163 


311.524 


Icelandic 


7,968 


2/9 


Various 




Celiac disease 


2,200 


310.605 


UK 


2,480 


5/27 


Dutch/Irish 


97 


Colorectal cancer 


1.890 


547,647 


Caucasian 


23,121 


2/18 


Same 






2,593 


99,632 


Canadian 


23.325 


2/1.143 




112 


Crohn's disease 


1,923 


304.413 


European 


2.150 


4/37 


Hz 


45 




1,103 


16.360 nsSNP 


German 


2,670 


3/72 


W 






1,475 


302.451 


Belgian 


2.236 


7/10 


W 


44 


IBD 


1.095 + 834 


308.332 


European 


2.885 




Same 


43 


LOAD 


1,086 


502,627 


Caucasian 


ND 


ND 


nd" 6 


90 


Lun 
ung cancer 


673 




Italian 


621 




Caucas' / 
Norwegian 


98 


Memory 


341 (pooled) 


502,627 


Swiss 


680 




Several 


87 


AMI 




65,671 cSNPs 


Japanese 


2.137 


4/26 






Nicotine dependence 


548 (pooled) 


2,427,357 


European 


1.929 


0/31 960 


European 


93 


Obesity 


4,862 


490.032 


British/Irish 


29,596 


1/1 


Same 


71 


Prolonged QT interval 


3,966 


88.500 


German 


4.451 


1/7 


European 


96 


Prostate cancer 


4.517 


316.515&243.957 


Icelandic 


3,655 


2/2 


Several 


53 






haplotypes 










12,791 


310,520 


Icelandic 


5,050 


2/5 


European 


78 




2.339 


550,000 


European 


6,266 


2/2 


Several 


/<) 


RLS 


2,045 


236.758 


European 


2,336 


9/13 




101 




15,970 


306.937 


Icelandic 


2.206 


1/70 


Icelandic/US 




12 DM 


2.335 


315.635 


Finnish 


2,473 


10/80 




55 




4,900 


393,453 


European 


9,103 


10/77 




63 




7,805 


313,179 6 339,846 


Icelandic/Danish 


3,382 


2/47 




57 






haplotypes 










1,316 


392,935 


French 


5,511 


8/5 7 




56 


T2DM and Triglyceride levels 


2,931 


386,731 6 284,968 


Finnish/Swedish 


10,850 


3/107 


Several 


54 






haplotypes 








T1DM 


3,388 


6,500 nsSNPs 


European 


12,229 


1/1 
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UK 
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lablc 1 | Loci and variants associated with multiple diseases in GWA studies 



Locus 

PTPN22 



Variant (rs) 

6679677 



T1DM 
11 DM 



PC 



T2DM 
Obesity 

Triglyceride level 
Alzheimer's disease 
PC 



79,81,113 
111-113 



shown linkage to T2DM in the Icelandic population, 
and significant association in a candidate gene associa- 
tion study*'. Heterozygous and homozygous carriers of 
TCF7L2 risk alleles had relative risks of 1.45 and 2.41, 
respectively. TCF7L2 is a transcription factor that 
regulates the pro-glucagon gene in entero-endocrine 
cells 63 . TCF7L2 alleles have also shown associations with 
endophenotypes such as a lower likelihood of response 
to the oral hypoglycaemic drug sulphonylurea 66 and 
increased risk of progression to T2DM among persons 
with impaired glucose tolerance 67 . 

In common with IBD, T2DM associations exhibited 
small estimated-effect sizes. Some of the candidate 
genes from GWA studies were consistent with biologi- 
cal processes that have previously been implicated in 
the pathogenesis of T2DM, such as pancreatic islet beta- 
cell function and insulin biosynthesis. However, these 
studies also suggested new components of these proc- 

Validation of T2DM candidate genes as therapeutic 
targets will require additional studies to identify causal 
susceptibility alleles and to determine their precise 
effect on cell biology. 

Three studies performed initial modelling of how 
loci combine to affect susceptibility to T2DM 56 - 63 ' 69 . One 
study found evidence of epistatic interactions between 
two genes. Otherwise I"2DM < earei to fit a polygenic 
threshold model with additive/multiplicative effects of 
individual loci. However, until the causal alleles that 
underpin these association signals have been found, it 
is not possible to make categorical statements about the 
allelic architecture of T2DM. 

Frequencies of T2DM associated alleles showed con- 
siderable variation between ethnic and racial groups. 
Despite these differences, however, T2DM-associated 



risk alleles were conserved between independent 
populations, implying an ancient origin of these 
polymorphisms '. 

Expansion of an initial association of an allele with a 
categorical trait (such as the presence of a disease) with 
quantitative component phenotypes (endophenotypes) 
is an approach pioneered with apolipoprotein I- ( \P< ); ) 
alleles in ,e. It appears to be highly 

instructive in elucidating the mechanism of action 
of alleles in disease pathogenesis. One T2DM GWA 
extended its analysis to a quantitative endophenotype: 
T2D.M related obesity (measured by body-mass index 
(BMI); IABU- I )*••''. Alleles associated with T2 DM in the 
fat-mass and obesity-associated gene ( /TO) 1 * also 
showed an association with BMI i Alius :>..-;.. Association 
of FTO with obesity has since been confirmed-'. 

Two GWA studies examined T1DM. One examined 
6,500 nsSNPs 28 and the other evaluated 392,575 SNPs 18 . 
Four Tl DM, susceptibility loci had previously been iden- 
tified by linkage-based methods (class II MHC alleles, 
CTLA4, PTPH2.2 and insulin). GWA studies replicated 
the association with PTPN22 and identified several 
novel loci, including Cl2or/30, KIAA0350 (also known 
as CLEC16A) and //•/// / (each replicated in two studies). 
Twenty-one Tl DM candidate genes that have previously 
shown linkage or association are currently undergoing 
replication studies". 

T1DM, like rheumatoid arthritis and IBD, is an 
autoimmune disorder. Medical practitioners have long 
noted familial aggregation of autoimmune diseases. One 
study showed association of both rheumatoid arthritis 
and T1DM with specific polymorphisms (I12RA- 
rs2104286 and PTPN22-rs6679677; TABLE 2) 18 . T1DM, 
rheumatoid arthritis and IBD also show association with 
MHC alleles 71 ". These findings suggest common under- 
lying aetiological pathways (and therapeutic targets) for 
several, common autoimmune disorders 77 . 



Cancer. GWA studies of cancer based o 
inherited SNPs are useful for the identification of germ- 
line risk alleles, but not somatic mutations. Three GWA 
studies sought inherited association signals in -v \ . 

, : ? shows details of the discovery phase 
of one of these studies. An association signal at chromo- 
some 8q24 that had previously been identified by linkage 
analysis 811 was replicated in two GWA studies 55 - 79 .' In addi- 
tion, these studies identified a second 8q24 association, 
approximately 300 kb upstream from the first. As yet, the 
functional basis of these associations is unclear. Although 
individual 8q24 alleles showed modest estimated genetic 
effects, the cumulative effect of several loci fit a multi- 
plicative model that conferred a population-attributable 
risk (PAR), that is, an expected reduction in prostate- 
cancer incidence if the risk alleles did not exist in the 
population, of up to 68% 81 . As noted above, PAR values 
are strongly affected by allele frequency and represent 
only an approximate measure of the contribution of 
those alleles to disease incidence. 

One study of prostate cancer 78 identified a TCF2 (also 
known as \ I IB) susceptibility allele. Intriguing]}-, this 
allele appeared to diminish the risk of T2DM fl ABLE '),, 



© 2008 Nature Publishing Group 



Table 3 j Loci and variar 


its exhibiting association with type 2 diabetes mellitus in GWA studies 


T2DM phenotype 


Locus 


Variants <rs) 


Refs 


Susceptibility 


TCF2 


4430796,7501939 


78 




TCF7L2 


4506565, 7903146, 7901695 


18,54-57,63 




PPARC 


1801282 


55,63 




KCNJ11 


5219,5212 


54,55,63 




SLC30A8 


13266634,rsll8253964 


55,56,63 




HHEX 


1111875 


55,63 




IGF2BP2 


4402960 


54,55,63 




CDKAL1 


9456871, 7754840, 10946398, 7756992 


18.55,57,63 




CDKN2A/B 


10811661,564398 


54,55,63 




Chromosome 11, intergenic 


9300039 


55 




FTO 


9939609.7193144. 8050136 


18.55.63 


Low density lipoprotein 


APOL 


4420638 


54 


High density lipoprotein 


CETP 


1800775 


54 


High-triglyceride level 


LPL 


17482753 


54 




GCKR 


780094 


54 



Above variants are associated P <5 X 10 '. T2DM. type 2 diabetes mellitus. 



possibly representing antagonistic pleiotropy. This is sup- 
ported by epidemiological evidence which suggests that 
diabetic men have a slightly lower prostate cancer risk 
than non-diabetic men 82 . 

Another allele exhibiting association in two diseases 
is rs6983267 at chromosome 8q24, which has shown 
replicated associations with prostate and colorectal 

Three GWA studies sought inherited associations 
with breast cancer 19,85 - 8 . Although each study identified 
significant novel loci, two genes and one allele were each 
supported in two studies. 

Complex traits 

In addition to common diseases, GWA studies are appli- 
cable to complex traits. One study undertook GWA with 
numerous quantitative and categorical memory-associated 
endophenotypes N '. Despite a small discovery cohort (341 
individuals), associations with the KIBRA (also known as 

i : ! gene have been replicated'-' A notable innova- 
tion in this study was thai associations were sought with 
multi-scale and multi-modality endophenotypes; that 
is, performance in seven memory-associated tests and 
functional magnetic resonance image-based measures 
of the hippocampus during three memory-associated 
tests. This study provides evidence that progress can be 
made in the elucidation of the genetic determinants of 
subjective, qualitative neurologic traits by using objective, 
quantitative, surrogate endophenotypes. 

As well as identifying novel associations, GWA studies 
have confirmed . ilsu eptibility genes that were pre- 
viously established by link; eanaly< n rgep I re< 
For example, a GWA study of late-onset Alzheimer's 
disease (LOAD) identified the well-established APOE- 
susceptibility allele 5 ". This a , ni is also replicated 
in a study that genotyped 17,343 putative functional 
cSNPs- 3 . 



Ail lain <i .Mem villi la \ \ studies is the 
cost of genotyping, but one study provided evidence that 
sample pooling strategies might help to overcome this 
issue. In a GWA study of bipolar disorder, investigators 
created 39 pools, containing DNA from 2,672 individu- 
als'". These pools were used for both discovery and repli- 
cation experiments. Pools were individually genotyped 
for 555,235 SNPs and normalized allele frequencies were 
inferred from intensity data. Replicates were assayed for 
each pool. Thirty-seven SNPs showing allele frequency 
differences in both cohorts were individually genotyped 
and one SNP retained a significant association. The 
aforementioned WTCCC study also studied bipolar dis- 
order, identifying an association at 16pl2 (REF, 18). One 
locus, for glutamate receptor, metabotropic 7 ( ), 
showed association in both studies. 

The rate of publication of GWA studies continues 
to increase. Recent studies have investigated asthma 92 , 
nicotine dependence 93 , coronary artery disease 19,26 , atrial 
fibrillation 94 , prolonged QT interval and sudden cardiac 
death 95 ' 96 , coeliac disease 97 , lung cancer", psoriasis 21 and 
liver cirrhosis 25 , among others (TABLE 1 j. 

Initial conclusions on the utility of GWA 

The utility of GWA studies for the identification of 
novel genomic associations with complex diseases 
has unambiguously been established over the past 
year. In general, GWA studies have employed large 
case-control horts featurin I il il uid spot 

uiu t .ases, catcgorii trait c ition , t ' 
million commonly polymorphic SNPs. To date, with 
the exception of CFH in AMD, the estimated genetic 
effects of replicated associations have been uniformly 
and surprisingly small. 

n i i iplot pe intervals 

identified to da! i il ently stria] u is i 

gene. In large measure, this reflects the use of several. 
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Association results for 
SNPs from the Hap300 chip 
in the 125-135 Mb interval 



130 131 132 133 134 135 



I he I2/.5-128.9 Mb 
interval (magnified) 



Figure 2 | Schematic view of genetic linkage, GWA results, fine mapping and 
linkage disequilibrium structure in a region of chromosome 8q24.21 that 
demonstrates an association of rsl447295 and rsl6901979 with prostate cancer. 

a | Previously reported genetic linkage scan results for chromosome 8, centiMorgans (cM) 
100-170 (that is, 8q) from 871 Icelandic individuals with prostate cancer in 323 extended 
families. A quantitative trait locus (QTL) for prostate cancer susceptibility with log of the 
odds (lod) score of ~2 is shown. The interval between the two dashed horizontal lines 
corresponds to a previously reported admixture signal that is associated with prostate 
cancer, b | Genome-wide association (GWA) results for 1.660 single nucleotide 
polymorphisms (SNPs) mapping to chromosome 8 Mb 125-135 in 1,453 Icelandic 
individuals with prostate cancer and 3,064 controls. Association testing P values smaller 
than 0.1, corrected for relatedness and population stratification, are shown for single 
SNPs (blue circles), two SNPs (red circles) and linkage disequilibrium (LD)-block 
haplotypes (green circles). Four SNPs (including rsl447295) and three haplotype blocks 
(including Hap C, defined by 14 SNPs) show significant association signals (P < 1.58 x 10~ 7 ). 

^ i ^dedP values were derived using Fisher's exact test and 

were unadjusted for multiple comparisons. Association testing of haplotype block P 
values were carried out using the expectation- maximization (EM) algorithm directly for 
the observed data, c | Association results from b, shown in greater detail, for a 1.4 Mb 
interval on 8q24.21. Filled black circles represent 225 SNPs and the orange boxes 
represent recombination hotspots (calculated from the HapMap using the likelihood 
ratio test), d | LD between SNPs, measured by the square of the correlation coefficient 
calculated for each pairwise comparison of SNPs (r') from the Centre d'Etude du 
Polymorphisme Humain from Utah (CEU) HapMap population for the 225 SNPs in c; the 
blue boxes at the bottom indicate the location of the FAM84S, AF26S618 and MYC genes 
and the AW183883 expressed sequence tag. Figure modified, with permission, from 
Nature Genetics REF. 55 © 2007 Macmillan Publishers Ltd. 



outbred populations in confirmatory, fine-mapping 
studies. Even when the association is within a single gene, 
thepredisposm might a i diacent gene, as 

in adult lactose intolerance*'. Although some associat ion 
intervals have been found to contain a single, unequivo- 
cally functional gene variant, the causality of alleles has 
been established in only a minority of cases. Causal alleles 
identified to date do not yet show much difference in 
genetic mechanism from those identified in Mendel ian 
disorders; this could reflect ascertainment bias MU \ 

Many genes identified by CAVA were not candidate 
genes preview 1 ighl liti th < i hesis-infbrrning 
value of genetic studies. Already, there are examples of 
potentially ti tct i t t hat had not pre- 

viously been considered in a disease or trait. As yet, the 
confluence of associated genes into biological networks 
and pathways is at an early stage. In part, this reflects scant 
or incorrect annotation of many genes. There appears 
to be a significant conservation of associations of com- 
mon alleles between human populations. Thus, to date, 
it appears that GWA studies are fulfilling expectations 
with regard to the elucidation of molecular mechanisms 
underpinning poorly understood, common diseases. 

In the few informative studies reported to date, endo- 
phenotypes have been highly instructive in dissecting the 
network or pathway that is perturbed by an individual 
allele, which affects a complex trait. It is particularly 
exciting to see the application of multi-mode endophe- 
notypes, such as combinations of psychological testing, 
brain imaging and gene expression 87 . This is clearly an 
area of potential opportunity. 

The cost of enrolling the very large cohorts that are 
needed to discover and validate alleles with small effect 
sizes has hitherto precluded the collection and integration 
of rich, accurate clinical metadata. It is likely that future 
studies will use a much greater stratification of traits than 
the phenotypically crude studies reported so far. Recent 
GWA studies of breast cancer provide a good example 
of the added genetic complexity that can be revealed by 
trait stratification"'* 5 ' 84 . In addition, following replication 
of associations with categorical traits, it is anticipated that 
targeted genotypic examination of many endophenotypes 
will be highly instructive in the dissection of the role of 
in i klu i! illt k n d ithogenesis 

GWA studies show significant potential to redefine 
disease classification. In some cases, GWA studies are 
identifying molecular factoi s that enable patient stratifi- 
cation and might prove useful in personalized medicine. 
Cancers provide the clearest examples of this to date. In 
other cases, exemplified by IBD, GWA studies are poi nting 
to common molecular underpinnings in diseases that 
were believed to be distinct In 
replicated associations have provided concrete evidence 
that the phenotypi i > i bt fide i irological 
disordc i I i r [ i' i eat ntk ipation 

that GWA. stud I i bjeeti nol cular 

revision of disease categorization. 

Man questioi rem i i the gene c archi- 
tecture of common diseases. These include the extent 
of locus and allelic heterogeneity, fit with an additive - 
Iti 1 lit i M 1 
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(the relative contributions of rare and common, and 
high and low, penetrans illd ' i j vaa jous types of 
variation, from genome rearrangements to SNPs. GWA 
studies are not designed to evaluate these questions. 
Once loc ha e btu d< mil metho ' 

as deep resequencing can nominate candidate suscepti- 
bility alleles and provide data for the evaluation of genetic 
architecture 102 . Meaningful, individual risk determina- 
tions will require the identification of causal alleles, the 
development of multiplexed molecular diagnostics and 
significant modelling. 

Future developments and implications 

The trends observed in recent GWA studies are antici- 
pated to continue. Chips with 900,000 and 1,000,000 
million SNPs were recently launched and genotyping 
accuracies have improved. Cohort sizes are steadily 
increasing and biobank.s nf unparalleled size andpheno- 
type definition are being established. Combinations of 
genotype- and h.ipb itypc based associations are becom- 
ing more prevalent. Experimental designs and statistical 
methods are also becoming more uniform, enabling 
more meaningful meta-analysis. In particular, the emer- 
gence of adaptive designs and the use of Bayesian infer- 
ential methods will produce a probabilistic synthesis 
from combined analyses* 3 . Importantly, this will provide 
an intuitive framework for combining information from 
multiple studies, resulting in more effective detection 
and replication of weak associations 10 '. 

As noted above, phenotypes studied to date have been 
crude. The use of endophenotypes is expected to increase 
significantly. In particular, biomarker phenotypes are 
anticipated to become widely used. These will probably 
include gene expression, proteomic, metabolomic and 
imaging biomarkers. As determinants of complex traits 
are identified, genetic stratification will become possi- 
ble, potentially reducing the genetic complexity of traits 
and enabling the identification of additional association 
signals. An example of this was the recent use of perio- 
dic limb movements and serum ferritin levels in GWA 
studies of restless leg syndrome 100 . An area of substantial 
future interest for the pharmaceutical industry will be 
pharmacogenetic GWA studies to identify markers for 
patient stratification in clinical trials. Comprehensive 
pharmacogenetic information will, in turn, facilitate 
the practice of personalized medicine. Pharmacogenetic 
GWA studies and early adoption of personalized therapy 
are likely to be used in the selection of expensive or 
chronic medications in life threatening conditions or 
where the therapeutic index is narrow or adverse event 
concerns are high, such as cancer chemotherapy. 



Despite the current excitement, GWA studies have only 
been able to account for a small proportion of the expected 
genetic variance in complex traits 241 "- . This is not surpris- 
ing given current limitations. First, current GWA studies 
are designed to identify common risk alleles that are pre- 
dicted to be in j rders under the 
mmondisease omn i 1 1 I lsis Increasing 
evidence suggests that some complex disorders and 
traits, such as schizophrenia, hypercholesterolaemia 
and body mass ire genet leous I Ik 
genetic basis of such diseases is more likely to conform 
to the common trait/rare variant hypothesis, which pro 
poses that many rare variants exist, with substantial allelic 
heterogeneity at causal bcf* " %m . The GWA approach is 
unable to detect susceptibility loci that harbour numer- 
ous, individually rare ( recent!, polymorphisms. Instead, 
a resequencing approach will be needed to identify rare 
alleles. Encouragingly, massively parallel sequencing 
methods provide a potential solution 102 " 17 " 0 , suggesting 
disease-specific rare alleles and recent mutations that pro- 
vide supplementary genotyping array content. Second, a 
proportion of the genome cannot effectively be examined 
on the basis of tag SNP genotypes. Approximately 20% of 
the genome is comprised of recombination hotspots that 
are not amenable to LD-based approaches 7 . Alternatively, 
at recombination coldspots, haplotype blocks might be 
too large for unambiguous identification of causal loci. 
The extent of the effect of genomic copy number variation 
(CNV) on association signals is not yet clear, although 
recent genotyping arrays do provide CNV information. 
Insufficient numbers of cases will be available for GWA 
studies of many orphan diseases, uncommon disease 
complications or adverse events. For some common dis- 
eases, these considerations could obfuscate a substantial 
proportion of the genetic variance. Supplementation 
of genotyping array content reflective of CNV regions 
should, however, circumvent some of these limitations. 
Use of adaptive statistical methods and resampling strat- 
egies might also circumvent the need for thousands of 
affected individuals in studies of orphan diseases 83 . 

GWA successes are creating substantial need for down- 
stream genetics, biochemistry and cell biology efforts to 
confirm the biological relevance of genotype-phenotype 
associations and to elucidate the under! ving mechanisms 
ot disease. This is especially true of association signals in 
gene deserts or alleles without apparent functional conse- 
quence. Translation of the fruits of GWA studies to clinical 
practice will require the derivation of predictive models of 
the genetic architecture of complex traits that evaluate with 
much greater precision the contributions of factors such 
as epistasis, genocopies, phenocopies and penetrance. 



1357-1369 (2001). 



Lander, E. & Kruglyak. L Ctrrn , , 

;-0S::!iy\ traits, tui'lciiu", so, it,;, n ,. ermr ,md 
reporting linkage results. Nature Genet 1 1 , 241 -247 
(1995). 

Chakravarti. A. Population genetics — making sense 
out of sequence. Nature Genet. 21 . 56-60 (1999). 
Reich, D. E. & Lander, E. S. On the allelic spectrum 
of human disease. Trends Genet. 17. 502-510 
(2001). 

The International HapMap Consortium. A haplotype 
t f I til I , 43^ ■■-)')■,■ 

(2005). 



Sherry, S. T. etal. dbSNP: the NCBI database of genetic 

it i i i , , , >9 
Hirschhorn J.N I 1 

Hirschhorn, K. A comprehensive review of genetic 
inn 4 ',' 

. loannidis, J P Ntzani EE r f „ 

Un*c r ''ica nn l ' (< i i- • , i i\ < 
genetic association studies. Nature Genet 29. 
306-309(2001). 

Carton L R SEH J I ni i ig, 
ii i , 

(2001). 



228 I MARCH 



| VOLUMF 



© 2008 Nature Publishing Group 



).:•• 361 i 



604 



(2003). 

i i Redden. D. T. & Allison. D B. i i n i 

genetic association studies of obesity and diabetes 

I, 1*3 1 < A)AA\ 

Aiiiuripaa. A1 ; a. Acaata-n K Rcp'icniion in ge-ncta 
i! ' 68 

646- 657 (2004|, 
l'. Lohitiueiic", K E. pf-arce. L L.. Pike. Lander E S 
& Hiisohraar J -A vA:3, -.visA v, of genetic 
asscaaanca p, i a a conta ipatior: of 



V «ur< '. * '.3 I ,'7-182 (2003), 

: p , i , i i i i i 

new millennium. /Varum 405, 847-856 (2000). 



4,000 cases of sew 



78(2007). 
:WA study u 

eta!. A geno 



ma '' . r 32, 650-654 



/L23/? as psoriasis-risk genes. An. J. Hum. Genet. 80, 
273-290 (2007), 

Clark. A C & I i, j Conjuring SNPs K> detect 
associations. Nature Genet. 39, 81 5-81 6 (2007). 

j. Crupe, A. etal. Evidence for novel susceptibility 
genes for late-onset Alzheimer's disease from a 
,» n mir wide association study ' ' tun a a 
variants. Hum. Mol. Genet. 1 6, 865-873 (2007). 

■. Hampe, J. etal. A genome-wide association scan of 
nonsynonymous SNPs identifies a susceptibility 
variant for Crohn disease in ATC I6LI. Nature Genet. 
39, 207-211 (2007), 

Huang, H. etal. Identification of two gene variants 
associated with risk of advanced hbro i in • u 
with chronic hepatitis C. Gastroenterology 1 30. 
1679-1687 (2006). 

. Luke, M. M. etal. A polymorphism in the protease-like 
domain of anolip • nm i in- d with sever e 

coronary artci v n' , \ ; ■ •< ■• 

Biol. 27, 2030-2036 (2007). 
Shiftman D t I , i r ui gene variants 

associated with myocardial infarction. Am J Hum. 
Genet. 77, 596-605 (2005). 

. Smyth, D. J. etal. A genomi i i t 



locus in the interferon- induced he 
Nature Genet. 38, 617-619 (2006). 
. Clarke, R. et al. Lymphotoxin-a gene and risk of 
myocardial infarction in 6,928 cases and 2,71 2 
controls in the ISIS case-control study. PLoS Genet. ~. 
el 07 (2006), 
Kimura. A. etal. Lack of a 
LCALS2 polymorphisms and myocardial i ' - tr-i 

69 P 265-269 (2007). P " S 

SAT; xrhhii ' LJA gene-rap lenn, -.rah pi.aaa.o-a 

Laxtoii R . FA-aico. E: . ls>-.jkou T & V S A-vaua 
.;! Ah- Ar-phataMn :■: gc-nc- ''la 26 Asa aeA is.': pia- n 
with severity of coronary atheros 
Immun. 6, 539-541 (2005). 



< n i r ith u 

different German populati 

1 + A 7 1 



polymorphisms of the lymphotoxrn a g 
myocardial infarction in Japanese, J. /v 
477-483 (2004). 



Discovery of a single variant that explains a larg 
component of the genetic variance in a common 
human disease. 

37. Hageman, C. S. etal. A common I 
compleri i t • . .i I - 
predisposes individuals to age-related macular 
degeneration. Proc. Natl Acad. Sci. USA 102, 
7227-7232 (2005). 

38. Magnusson. K. P.etal. CFH Y402H confers similar 
risk of soft nrnsen and bolh Ici'H of advanced A All 
PLoS Med. 3,e5 (2006). 

39. Souied. E. H. et al. Y402H complement factor H 

Vis. 1 1 , 1 1 35- 1 1 40 (2005). P P " at ° ' ,C 



>. 149-153(2005). 
n in factor B (Sf) and 
nt 2 (C?) genes is a i.i 



wide association study 



P - '. . 

81-885 (2007). 



131-1336 (2007). 

s detects multiple suso 
1341-1345(2007). 

for type 2 diabetes. N 



39 770-775(2007). 

lull* \ LI t II , I ' 

aesr the importance of 
rare variants for complex diseases. J. Med. Genet. 42, 
221-227 (2005). 

Morris, R. W. & Kaplan, N. L On the advantage of 
hapiotvpep a , 1 1 I 

susmptibl'ityA.lleies Gen," H„ 23 -A v - a 3 a 

(2002). 

. Zhang, K„ Calabrese. P., Nordborg, M. & Sun, F. 
Haplotype block structure and its applications to 
association studies, power and study designs Am J 
Hum. Genet. 71, 1386-1394 (2002). 



Fhang. K r< Ap:,., I A.,-c-,-,sig A'a- ::;«p . ! Pa- S'sl-A 
inthemappin t n n i i ri 
. i 

(2005). 



I of pro; 



gene Scwnce 314, 1461-1463(2006). 
44. Libioulle, C. et al. Novel Crohn disease locus identified 
by genome-wide association maps to a gene desert on 
5p 1 3. 1 and modulates expression of PTCER4. PLoS 
Genet. 3, e58 (2007). 

45 Rioux, J. D. ef al. Cenome-wide association study 
identifies new susceptibility loci for Crohn disease and 
implicates autophagy in disease pathogenesis. Nature 
Genet. 39, 596-604 (2007). 

46 Parkes, M. etal. Sequence variants in the autophagy 
gene IRGM and multiple other replicating loci 
contribute to Crohn's disease susceptibility. Nature 
Genet. 39, 830-832 (2007). 

47. Yamazaki, K. etal. Association analysis of genetic 

variants in IL23R, ATC 1 6L I and 5p 1 3. 1 loci with 

Crohn's disease in Japanese patients. J. Hum. Genet. 

52, 575-583 (2007). 
48 Hugot, J. P. et al. Association of NOD2 leucine-r ich 

repeat variants with susceptibility to Crohn's disease. 

Nature 41 1 , 599-603 (2001). 
; Ogura. V et al. A frameshift mutation in NOD2 

associated with susceptibility to Crohn's disease. 

Nature 41 1,603-606 (2001). 
50. Xavier, R. J. & Podolsky, D. K. Unravelling the 

pathogenesis of inflamma 

448, 427-434 (2007). 



nome-wide analyses of 
'f. 39. 857-864 (2007). 
vide association study for 



i 280 4 4 | i A 

1 i E • ,i i, - 

therapeutic rasnea-.e to -.ctiaayaiieaa a CoDARTs 
study. Diabetes 56, 21 78-21 82 (2007). 
A Florez, J C <t . i r ► ■ i u u, and genotype- 
uhenoApe correlations erf bra suitor ij 'ire a receptor 
and the islet ATP-sensitive potassium channel gene 
region. Diabetes 53. 1 360-1 368 (2004). 
Hi,,.,-, f ii ,i h i , „ t of TCF7L2 gene 
evolution. 

i-t 39 218-225 (2007). 
'11,-- i i a - 1 ' i matron from 
common type 2 diabetes risk polymorphisms improves 
disease piedi.n •/ ' 3 , I AW OS) 
Stephens. J C ac ;,: Hap ! vera: vaaaaen and linkage 

i 1 hum hi genes. Science 293, 

489-493 (2001). 

Pi ayiing, T. M. et al. A common variant in the FTO 
gene is associated with body mass index and 

pa-arrpr.a.a to ; Ir Aheap ana inn t caipg, :v aaa a 

316, 889-894 (2007), 
Ii- ir f K n in FTO contributes to 

childhood obesity and severe adult obesity. Nature 
Genet. 39, 724-726 (2007). 
. Rich, S.S.et al. The Type 1 Diabetes Genetics 
Consortium. Ann. NY Acad. Sci. 1079, 1-8 (2006). 

ip 1 March ri E a pi 
I I r ir i i t I It LA 

complex. World J. Gastroenterol. 12, 3628- 3635 
(2006). 

Orozco. C., Rueda, B. & Martin, J. Cenetic basis of 
rheumatoid arthritis, Biomed. Pharmacother. 60, 
656-662 (2006), 

r ' 'A • • i " ri.s loleofHLAclasslgene 

a betes. Rev. Diabet. Stud. 

2,97-109(2005). 

' II 1 '• ' V i I t i i ;ha ilia . 
i ' ii in i i ii i tium (MADGC) 

he PTPN22 620W allele associates with 



1-571 (2005). 



977-983 (2007). 

> ' 'I * i .natron study of 

prostate cam , in „ 
' ' 39 A) (2007) 

I I 

| ii' it i 33 
I i n ' ii, i ns within 8q24 

ii i I - ' rs 1 t i 

Genet. 39, 638-644 (2007). 

Rodriguez. C. FPU DiaPa'a-, and re.k -A prostate 



Epidemiol. 161. 147 -152 (2005). 

KaighP J C Rc-guArora :.-a,l!-a--ohr-,ii-r aiicA-ayrr.g 

r I 3 

(2005), 

11 ' A " 1 r ri c , 

polymorphisms associated with complex vs. 
Mendelian disease: evolutionary evidence for 
differences in mo'ecu -a effects Acer N'ltl.Vad Sr , 
USA 101, 15398-15403 (2004) 



2q35and 16ql2cc 
receptor-positive brc 
865-869 (2007), 



© 2008 Nature Publishing Group 



'1', . 

I i i r A. ;c dependent 

a'-a.aa'ean vt KHiRA ."enetr vananors and Alzheimer's 
disease risk, Neurobiol. Aging 1 6 Aug 2007 
|doi:l< 3007 07.003] 

Sabaix- K K.aiati H Oouo J Wagner. M 6 
lessen. F ' I i ' are assy.i.jted n 

eiasa-Ja a <a'arv in l.eaalsv mne-'v V-momo/ Or/ae-a 

• ■ •.. ooo/ 'i i- ! o:o ; 

Coon K D eta: 0 m;:h d--n- rv e.haie faaiome 
association -..'..-J, '«.aa, ilia: Of'Of a n-.a raaa» 

faH 1 

(2007). 

Baum, A. E. et al. A genon i m i i ' 

n f til' J I I r 



childhood asthma. Nature 448, 470-473 (2007). 

Inch density is-neno aids eeaaxaaaoii rood,, far nicotine 

r 10 t 

I II 11 1 lit nil ii nl i I I 1 

448 

353-357 (2007). 

Aamoudse, A. J. el at Common VOSMP variants are 
associated with a prolonged QTc interval in the 
Rotterdam Study. Circulation 1 1 6, 1 0- 1 0 3007) 



g, D. E. e 



jr NOSIAP modulates .. adiaa 



re Genet. 38, 644-65 1 O'aiai 
van Heel, D. A. etal. A genome-wide association soja , 
forceliacdiseaseidentitie.it f nit! in 
I • 

(2007), 



Cenet 12. 2333-2340(2003). 
0. Stefansson, H. etal. Agei e'i 'i ! " O > n 
limb movements in sleep. N. Engl. J Med. 357, 
530 647(2007). 
. : Winkeimanta J err--; Genome- once aasoci ition srudy 
of restless legs syndrom 
in three genomic regions. Atafure Cenet 39, 
1000-1006(2007). 

Altshuler, D. & Daly. M. Guilt beyond a reasonable 

doubt. Nofure Cenet. 39, 813-815 (2007). 

Hunter D J & Kraft 0 Dunking from the Are hose — 



identifies iaajaatjaaiilia sns.setibia' 
I > 1 

(2007). 



N. Engl. J. Med. 357, 436-439 (2007). 
.. Ahituv, N. etal. Medical sequencing at th< 
of human body mass. Am. J Hun; f a r e; 
779-791 (2007). 



Acknowledgements 

and U01AI066.569, and by Na 
grant 0524,775, The authors th 



o v, i wti illy sup 
i grants N01A000.064 
al Science Fc 




ALL LINKS ARE ACTIVE IN THE ONLINE PDF 



in i 



© 2008 Nature Publishing Group 



Frayling, TM, Nature Reviews Genet 8:657-662 (2007) 



PROGRESS 



Genome-wide association studies 
provide new insights into 
type 2 diabetes aetiology 

Timothy M. Fray ling 

Abstract | Human geneticists are currently in the middle of a race. Thanks to a new 
technology in the form of 'genome-wide chips', investigators can potentially find 
many novel disease genes in one large experiment. Type 2 diabetes has been hot 
out of the blocks with six recent publications that together provide convincing 
evidence for six new gene reg ions involved in the condition. Together with candidate 
approaches, these studies have identified 1 1 confirmed genomic regions that alter 
the risk of type 2 diabetes in the European population. One of these regions, the 
fat mass and obesity associated gene (FTO), represents by far the best example of 
an association between common variation and fat mass in the general population. 



Genome-wide association studies (GWAS) 
promised to greatly enhance our under- 
standing of the genetic basis of common 
and complex diseases. Companies such as 
Affymetrix and lilumina have developed 
chips that can capture information from 
more than two-thirds of the common vari- 
ation in the human genome. Approximately 
300-500,000 SNPs can be analysed using 
these chips. Importantly, this can now be 
done for several thousands of DNA samples 
at costs that are within the scope of large 
project grants. 

This technology recently facilitated rapid 
progress in type 1 diabetes genetic research. 
This is all the more remarkable because type 
2 diabetes does not have a strong genetic 
component compared with some other com- 
mon traits, and was previously described 
as a geneticist's nightmare' 1 ' 2 . Nevertheless, 
early results have been excellent, yielding six 
new replicating gene regions. 

Here I discuss the insights into type 2 
diabetes genetics that have been provided by 
these new findings. I consider where diabe- 
tes genetic studies might go from here, and 
present a perspective that may be applicable 
to other common traits. I also briefly discuss 
the wider implications that surround the 
identification of a common gene that predis- 
poses to type 2 diabetes by altering fat mass. 



The geneticist's nightmare 

Type 2 diabetes is one of the leading health 
problems throughout the developed world 
and is becoming increasingly important in 
the developing world. It has risen in preva- 
lence dramatically in the past two generations 
as we have come to lead increasingly sed- 
entary lifestyles and as food has become 
more plentiful. Obesity, defined as a body 
mass index (BMI) of greater than 30 kgm 2 
increases the risk of type 2 diabetes. Such a 




strong environmental component to a dis- 
ease should perhaps have deterred geneticists 
from studying the disorder. However, there 
are many obese people who do not suffer 
from diabetes and many non-obese people 
who do, showing that obesity is not the only 
factor involved in the aetiology of type 2 
diabetes (FIG. 1). 

In the past 10 years, geneticists have 
devoted a large amount of effort to finding 
type 2 diabetes genes. These efforts have 
included many candidate-gene studies, exten- 
sive efforts to fine map linkage signals 3 , and 
an international linkage consortium that was 
perhaps the best example of a multi-centre 
collaboration in common-disease genetics. 
Of these efforts, only the candidate-gene 
studies produced unequivocal evidence for 
common variants involved in type 2 diabetes. 
These are the E23K variant in the potassium 
inwardly-rectifying channel, subfamily J, 
member 1 1 { KCNjl I) gene 1 the P 1 2A 
variant in the peroxisome proliferator- 
activated receptor-y(PMRG) gene 7 , and 
common variation in the transcription 
factor 2, hepatic (TCF2)'" and the VMfr.am 
syndrome 1 ( WFSIJ 0 genes. All of these 
genes encode proteins that have strong 
biological links to diabetes. Rare, severe 
mutations in all four cause monogenic forms 
of diabetes" 14 , and two are targets of 
anti-diabetic therapies: KCNJU encodes a 
component of a potassium channel with a 



j □ Non-diabetic patients (mean age 59)1 
|0 Diabetic patients (mean age 64) j 



Figure 1 1 Distribution of body mass index in type 2 diabetic patients compared with non- 
diabetic individuals of a similar age, individuals come from the Diabetes Audit and Research ir 
Tayside (DARTs) study in Scotland 20 . 
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Figure 2 | Effect sizes of the 11 common variants confirmed to be involved in type 2 diabetes 
risk. The x axis gives the year that published evidence reached the levels of statistical confidence that 
are now accepted as necessary for genetic association studies. CDKAL1, CDK5 regulatory subunit- 
associated protein 1-like 1; CDKN2, cyclin-dependent kinase inhibitor 2A; FTO, fat mass and 
obesity-associated; HHEX, haematopoietically expressed homeobox; IDE, insulin-degrading enzyme; 
ICF2BP2, insulin-like growth factor 2 mRNA-binding protein 2; KCNJ11, potassium inwardly-rectifying 
channel, subfamily J, member 11; PPARC, peroxisome proliferator-activated receptor-y gene; 
SLC30A8, solute carrier family 30 (zinc transporter), member 8; TCF2, transcription factor 2. hepatic; 
TCF7L2, transcription factor 7-like 2 (T-cell specific. HMC-box); WFS 1, Wolfram syndrome 1. 



key role in p-cell physiology that is a target for 
the sulphonylurea class of drugs, and PPARG 
encodes a transcription factor involved in 
adipocyte differentiation that is a target for 
the thiazolodinedione class of drugs. 

In 2006, deCODR genetics identified 
common variation in the TCF7L2 gene as a 
type 2 diabetes risk gene region 15 . This result 
was encouraging for two reasons. First, this 
study analysed more than 200 markers across 
a region of linkage, but the variants that were 
found to alter risk did not explain the linkage 
signal, suggesting that a non-candidate-gene 
or region-based association effort (such as 
a GWAS) could work. Second, TCF7L2 was a 
completely unexpected gene — this showed 
that a genome-wide approach could uncover 
previously unexpected disease pathways. 

In early 2007, GWAS provided by far the 
biggest increment to date in our knowledge of 
the genetics of this common health problem. 



Six new gene regions identified 

Together, the six recent GWAS papers 
provide convincing evidence for six new 
gene regions involved in type 2 diabetes 16 " 21 ; 
a seventh publication describes how one 
of these variants alters BMI and represents 



by far the best example of a 
between common genetic v 
obesity 22 . There are now 1 1 gene regions 
in which common variation alters type 2 
diabetes risk with the levels of statistical 
confidence that are required by genetic 
association studies [FIGS 2.3). This progress 
is all the more remarkable in view of the 
weak genetic component to type 2 diabetes 
risk, as compared with many other common 
diseases that are currently being studied 
using GWAS. The sibling relative risk is 3-4 at 
the most for type 2 diabetes, in contrast with 
5-10 for rheumatoid arthritis, 1 5 for r. re i 
diabetes, 7-10 for bipolar disorder, 17-35 
tor ( r <hn J . ist 2-" for - , i 

iii fare! ion and 2.5-3.5 for hypertension 21 . 

The six papers"" 21 describe five separate 
type 2 diabetes GWAS — extensive rep- 
lication data from a study in the United 
Kingdom resulted in a second paper arising 
from the one initial genome-wide scan. 
These five studies had several features in 
common. First, they all used relatively large 
sample sizes. In combination, DNA samples 
from more than 18,000 individuals were 
analysed on the genome-wide chips, with 
the number of cases ranging from 686 to 



1924 and the number of controls from 669 to 
5,275 in each study. Second, all studies used 
DNA samples that were collected from a well 
defined country or area of Northern Europe, 
and all participants were of Northern 
European ancestry. This reduced the likely 
impact of popir.jtion .•jdmistum — one of 
the few possible confounding factors that can 
occur in genetic association studies. Third, 
all five studies used extensive follow-up 
case-control studies. The large number of 
tests a GWAS results in — up to -400,000 
— means that p values of ~5xl0~ 7 are needed 
to provide a study-wide p value of 0.05. The 
investigators suspected that there would 
not be many signals that reached this level 
of significance, so each study assembled 
between 2,473 and 10,850 additional cases 
and controls in which to assess their top 
'hits'. This brought the total number of 
cases and controls used in the five studies to 
approximately 55,000. 

Although there were some differences 
in phenotype definition between studies — 
some used type 2 diabetic patients of younger 
age at diagnosis or controls and cases with 
similar BMI — the overall approaches were 
remarkably similar. It was therefore reas- 
suring to observe consistent results across 
the five publications. Five of the six gene 
regions were reported in at least three papers, 
and meta-analyses of the individual studies 
show that the statistical confidence of the 
findings ranges from 1x10 12 to 1x10"" 
(Supplementary information SI (box)). 
The fat mass and obesity-associated ( FU )) 
gene region emerged from only one study 
but, because the association is with BMI, 
it could not have been detected in studies 
that used cases and controls of similar BMI. 
Where more than one publication reported 
the same locus but a different SNP, the SNPs 
were always strongly correlated (on the 
basis of high r 2 values). This showed that 
the studies had found the same risk allele, 
even if they had 'tagged' it with a different 
variant, thereby providing true replication. A 
summary of the findings is shown in TABLE 1 . 

What defines a novel gene? 

In all cases, the investigators have taken great 
care to qualify their new findings by saying 
that they have identified robust 'signals' of 
association or gene regions' rather than 
actual genes. There are two reasons for this. 
First, the correlation between common 
genetic variation (linkage disequilibrium) 
means that the best association that has been 
found so far might not represent the causal 
variant, or combination of variants. For 
example, in the haematopoietically expressed 
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Figure 3 | Association statistics from one of the five type 2 diabetes 
genome-wide association studies' 0 . The y axis represents the -loglO 
p value and the x axis represents each of the -400,000 SNPs used in this 
scan. The point of each arrow indicates the location of the most strongly 
associated SNP in each of nine known type 2 diabetes gene regions. 
Two signals, in SLC30A8 and TCF2. were not captured on the Affymetrix 
chip. The plot was generated using Haploview. CDKAL 1, CDK5 regulatory 
subunit-associated prote in l-tike 1; CDKN2, cyclin-dependent kinase 



inhibitor 2A; FTO. fat mass and obesity-associated; HHEX, 
haematopoietically expressed homeobox; IDE, insulin-degrading 
enzyme; ICF2BP2, insulin-like growth factor 2 mRNA-binding protein 2; 
KCN]1 1, potassium inwardly-rectifying channel, subfamily J, member 11; 
PPARG, peroxisome proliferator-activated receptor-ygene; 5LC30A8, 
solute carrier family 30 (zinc transporter), member 8; TCP 2, transcription 
factor 2. hepatic; TCF7L2, transcription factor 7-like 2 (T-cell specific, 
HMG-box). 



homeobox (HHEA')-insulin degrading 
enzyme (IDE) region, strong linkage dis- 
equilibrium extends across three genes, two 
of which are plausible candidates. Second, 
the functional variant might not lie anywhere 
near the coding parts of a gene, but might 
instead influence a regulatory element that 
might or might not be characterized yet. The 
QPKN2A-a>KK2H locus, which contains 
cyclin-dependent kinase inhibitor genes, 
seems to be an example of this; the associa- 
tion is some distance from the nearest genes. 

The uncertainty about where the 
causal variants lie does not detract from 
the strength of the associations; so, what 
strength of association convinced the 
teams that the associations they found were 
real? Much has been written about what 
constitutes a real finding, including a recent 
report from a working group that was set up 
specifically to address this particular ques- 
tion 23 . One of the main criteria is a p value 
of approximately <5xl0 7 . Although this 
estimate was made on the basis of only a 
handful of genes so far, this value seems to be 
about right for type 2 diabetes. For example, 
in cases in which individual genome scans 
found evidence of association at p < IxlO 7 , 
the results held up: TCF7L2 and FTO are 
two examples. By contrast, many signals that 
reached nominal levels of p = Ixl0" s -lxl0~ 6 
did not hold up when more samples were 



added, or at least have remained in the 
'further studies needed' category. This was 
the case for the variants in the exostoses 
(multiple) 2 (EXT2) and LOC387761 genes, 
which reached p values of 1.8x10" 5 and 
1.2xl0~ 5 , respectively, in the genome-scan 
data reported by Sladek et al„ but did not 
hold up across all the studies. There is 
further evidence in the form of intermediate 
trait data — some of the variants are associ- 
ated with a diabetes-related phenotype in the 
general population (TABLE I ). This supports 
the evidence that the associations represent 
genuine biological findings. Among the 
new signals identified by GWAS, the most 
convincing is the association of the FTO dia- 
betes risk allele with increased BMI (BOX ! ) 
in the general population, but there is also 
evidence that the risk alleles in the CDK5 
regulatory-subunit-associated protein 
Mike 1 gene ((\7)A\4/,/)' 6 ' 19 , HHEX-IDE 
(L. Pascoe, E. Ferrannini & M. Walker, per- 
sonal communication) and the solute-carrier 
family gene SLC30AS ;REF 1 9) regions reduce 
insulin secretion in healthy adults. 

New associations lead to new aetiology 

Of the 1 1 type 2 diabetes gene regions, 4 
were found by candidate-gene studies; of 
the remaining 7, none contains obvious 
candidate genes. The risk conferred by the 
individual gene variants is small, but this 



only reduces the potential predictive value of 
the risk alleles. A small impact on risk does 
not detract from the most important aspect 
of the findings — new associations, provided 
that they are reproducible, bring now knowl- 
edge about disease aetiology. The results of 
the type 2 diabetes GWAS implicate several 
pathways involved in (3-ceIl development 
and function. 

Common variants in TCF7L2 emerged as 
one of the top signals, if not the top signal, in 
each study. This is reflected in the estimates 
of the odds ratios that are conferred by each 
additional risk allele carried (TABLE l ) at 
each locus. Each T allele of the key TCF7L2 
SNP, rs7903146, increases disease risk with 
an odds ratio of 1.37. This is substantially 
higher than all of the other 10 gene variants, 
for which the odds ratios range from 1.10 to 
1.20. The claims that TCF7L2 represents, by 
some distance, the most important type 2 
diabetes gene are also reflected by the sample 
sizes needed to have adequate power to 
detect the effect. Approximately 1 ,380 cases 
and 1,380 controls are needed to detect the 
TCF7L2 effect with 80% power at p = 5xl0 7 , 
on the basis of allele frequencies in the UK 
population. The gene region with the next 
strongest effect, the CDKN2A-2B signal, 
requires 6,200 cases and 6,200 controls. It is 
still possible that there is a larger signal or 
a similarly sized signal that has so far not 
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1 1 Details of 11 type 2 diabetes gene regions 



evidence 
(p value)* 



KCN)U 
TCF7L2 



rs5215 
(E23K) 
rs7901695 

rs<1430796 
rsl0010131 



isH/bhhil Sl.CiO/VS 



Candidate 

Region-wide 

Candidate 
Candidate 



Monogenic + 
drug target 



Monogenic 
Monogenic 



Additional evidence Odds ratio RAF N* 
from human physiology (per allele)* (UK) 



disrupted 
pancreatic 
development 
Genome wide None 



,-s!()'MM98 CDKAL1 Genome-wide None 



reduced islet 
proliferation 

Genome-wide Some — 

binds insulin- 
like growth 
factor mRNA 

Genome-wide None 



rs4402960 ICF2BP2 



rs8050136 FTO 



Nothing 



Alters insulin s< 
general population 
Alters insulin secretion in 
general population 
Nothing consistent 
Nothing consistent 

Early studies indicate 
altered insulin secretion 
in general population 



Early studies indicate 
altered insulin secretion 
in general population 

Early studies indicate 
altered insulin secretion 
in general population 
Nothing consistent 



1.14(1.08 -1.20) 0.87 >20,000 

1.14(1.10-1.19) 0.35 15,600 

1.37(1.31 1.43) 0.31 2,760 

1.10(1.07 1.1 4) 0.47 >20.000 

1.11(1.08-1.16) 0.60 >20,000 

1.15(1.10-1.19) 0.65 12.800 



1.15(1.12-1.19) 0.69 14.400 N/C 



1.14(1.11-1.17) 0.32 16,200 65 



1.20(1.14-1.25) 0.83 12,400 272 



1.14(1.11-1.18) 0.32 16,200 1,026 



1.17(1.12-1.22) 0.40 10,400 



datafromREF. 18.CDKAL2.wl 
WFS1. which is based on data fr 
allele frequencies and assuminc 
frequencies >1%. BMI. body ma: 
associated; GVVAS. genome -u k 
binding protein 2; KCN1U, pota 
gene; RAF, risk allele frequency: 



I t r I i ! ut i ocidtedpro 
. • > • xmatopoietically expressed I 



s ratios from REFS i 6. 1 7.20, except for the signals for HHEX-IDE, which also includes 
des data from all five GVVAS studies, FCF2, which is based on tafrom I I 

a 1:1 ratio to provide 80% power to detect an effect at p = 5x10 '. on the basis of UK risk 
;-wide scan 20 in the list of 393,453 passing quality control and with minor allele 
tein Mike 1; CDKN2, cyclin-dependent kinase inhibitor 2A;f fO. f,,t mass and obesity- 
•romeobox; IDE. msni , , , , II th factor 2 mRNA- 

KO. knockout: N/C. not captured; PPARC , peroxisome proliferator-activated receptor-", 
.er8;TCF2,transinpt I II H i, i lear fat tor TCF7L2, 



been identified, because the chips that were 
used do not cover all of the common and 
little of the rare variation; however, it seems 
likely that common variation in TCF7L2 will 
represent the most important type 2 diabetes 
locus in terms of its risk effect and the 
Irequenc} of h . 1 tot is kn< n 

about how TCF7L2 predisposes to type 2 
diabetes. It encodes a transcription factor 
that is expressed in the fetal pancreas and 
is involved in the YVNT signalling pathway. 
One of its targets is HHEX, which lies in 
one of the other six identified novel diabetes 
gene regions s i thin which 
HHEX falls spans 295 kb and includes three 
genes: kincsin ct c 1 n s 

in between HHEX and IDE. HHEX, which 
encodes a transcription factor with a key role 



in pancreatic development (knockout mice 
lack a ventral pancreas 24 ), seems to be the 
most likely candidate out of these three. The 
association of the diabetes risk allele in the 
HHEX-KIF11-IDE locus with reduced insu- 
lin secretion further strengthens the claims 
that the risk operates through altered (3-cell 
function (Pascoe, Ferrannini & Walker, 
personal communication). 

The association signal on chromosome 
9 lies -120 kb from the 3' end of CDKN2B, 
which lies next to its close relative CDKN2A. 
Interestingly, recent GWAS have identified a 
separate set of SNPs that seem to represent 
the strongest common genetic risk factor for 
heart disease (myocardial infarction) 21 - 25 - 26 . 
There is no correlation between the diabetes 
signal and the heart disease signal, but the 



latter does fall closer to the CDKN2 genes. 
CDKN2A, which encodes pl6INK'\ over- 
expression of which leads to decreased islet 
proliferation in ageing mice 27 , is the most 
likely candidate for type 2 diabetes. Initial 
human physiology studies have not provided 
any evidence that the risk alleles alter insulin 
secretion, but the mouse phenotype strongly 
implicates fj-cell dysfunction. 

The association of SNPs in CDKAL1 
ranked near the top in four of the five 
scans. Little is known about CDKAL1, but 
it is highly expressed in human islets 20 . 

i ' han i iththeCDKS 
regulatory-subunit-associated protein - 1 
gene (CDK5RAPI), a known inhibitor 
ofCDKS activation. CDR5 is implicated 
in reduced fj-cell function, through the 
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formation ofp35-CDK5 complexes, which 
I i ion H e 

association between the diabetes risk allele 
in the CDKAL1 locus and reduced insulin 
secretion further strengthen.-, the claims that 
altered p-cell function underlies this risk 15 '. 

Less is knowi i! i i 

the associated regions, insulin like growth 
factor 2 mRN \ h protein i 

binds to the key growth and insulin signal- 
ling molecule insulin-like growth factor! 
(IGFII) and is also expressed in the pancreatic 
islet 3 '. SLC30A8 is a zinc transporter that is 
expressed in the P-cell. Of the six new gene 
regions, the variant in SLC30A8 that is associ- 
ated with diabetes risk is the only one that 
has an ob\ ious functiona t | 1 < iti< n it is 
non- synonymous, changing an arginine to a 
tryptophan. 

FTO is the most mysterious of all. A multi- 
gene deletion in a mouse model that resulted 
in a fused-toe phenotype gave the gene its 
original name 3 ", but to date the best clue to 
its role in obesity in humans is its expression 
in the hypothalamus the key part of the 
brain that influences appetite. 

Finally, deCODE genetics recently showed 
that the common type 2 diabetes risk alleles 
in TCF2 also protect men from prostate 
cancer (p = 1x10-") (REF. 8). This raises the 
intriguing possibility that different alleles 
at the same locus could predispose to cell 
overgrowth on the one hand and. perhaps, 
cell degeneration or reduced cell turnover, for 
example, of the p-cell, on the other hand. 

What next? 

The five type 2 diabetes GWAS make it clear 
that human genetic studies are entering a 
new era. This is almost certamh the ease 
for most common diseases for which results 
from several GWAS are, or soon will be, in 
the public domain. Researchers hav e gone 
from being able to analyse, at best, a few 



Box 1 | FTP ge ne variants alter fat m 

Changing lifestyles overt 



n the general population 



ipid increase in the mean body 
enormous impact on obesity- 
n and possibly even cancer and 



;s over the past few decades have caused a i 
mass index (BMI) in most developed countries. This has had at 
related illnesses, including cardiovascular disease, hypertensi 
depression, in addition to type 2 diabetes. 

One of the most exciting findings to have come from the genome-wide association studies 
(GWAS) therefore was that fat mass and obesity associated (FTO) genotypes predisposed to 
type 2 diabetes by altering BMi. The importance of this was highlighted by a separate article 
that described an association of FTO with BMI and obesity risk in the general population 22 . 
Using -30,000 adults, the investigators found that the 16% of the European population 
carrying two copies of the diabetes risk allele were -1.0 kgm 2 or 2.3 kg heavier than the 
35% carrying two copies of the non-risk allele. The statistical confidence behind these findings 
(p = 5xl0^ 5 ) and its replication in a further study of approximately 8,000 people (p = 2x10-"') 
(REF. 33) make this by far the most convincing evidence for a common gene variant that alters 
BMI. Even more striking was the finding, using data from more than 5,000 children aged 9 
years, that the association was with fat mass, with little or no effect on lean mass 22 . 

The public health message has been that everybody should eat less and exercise more to 
reduce their fat mass. The above finding has not changed this message, but does emphasize that 
some people will find it harder to attain an optimal weight in today's environment than others. 
Understanding how FTO alters fat mass is likely to increase greatly our understanding of obesity. 



thousand polymorphisms in a few hundred 
genes, to more than 80% of the common 
SNPs in the genome, making it difficult to 
justify case-control studies that do not have 
a genome-wide scope. Researchers will soon 
be able to simply look up the case-control 
results on the web. 

An important next step will be for the 
geneticists to pass the baton to their clini- 
cal, cellular, animal and molecular biology 
colleagues so that they can work out the 
mechanisms behind the associations between 
the new gene variants and disease. But does 
this mean that human geneticists will be out 
of the competition? Far from it, there are 
many areas that need to be addressed. Below 
are some examples, most of which will apply 
equally to other diseases. 

First, it will be important to fine-map 
the new type 2 diabetes gene regions. One 
of the first steps will involve deep sequencing 
and further rounds of genotyping to build 
up a hill picture of all the possible common 



that might explain the as 
signals. This should include efforts to define 
copy number variants such as duplications 
and deletions, and should also attempt to 
define independent associations in the 
same gene regions. The confirmation that a 
region is involved in disease makes it much 
more likely that additional variation in the 
region will predispose to disease. The extent 
to which African populations, which show 
reduced linkage disequilibrium, will help to 
fine-map these regions remains to be seen. 

Second, further association studies of the 
new variants are needed. Investigators will 
need to assess their role in other popula- 
tions, especially populations with a high 
prevalence of diabetes, such as South-Asian, 
African -American and Mexican- American 
populations. Further studies of the role of 
risk alleles in the general population are also 
important. Type 2 diabetes is associated with 
many traits, including reduced insulin secre- 
tion, insulin resistance, birth weight and 
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inflammation. Preliminary data show that 
some of the type 2 diabetes risk alleles are 
associated with reduced insulin secretion in 
healthy individuals, but more human physi- 
ological studies of this type are needed to 
assess the role of the variants in other traits. 

Third, geneticists need to assess the pre 
dictive value of the disease variants. Given 
the small effect sizes that have been seen so 
far, m rial I not 

predicting type 2 diabetes. However, more 
extensive studies are needed to assess the 
usefulness of combining information from 
multiple variants. The best way of addressing 
this question is with prospective studies. 

Fourth, tin. po ntial piiai nacogenetic 
role of new variants needs to be assessed. 
For many years, the holy grail of common 
disease genetics has been personalized treat- 
ment. Recent da i ii ;est that TCF7L2 risk 
alleles predict treatment response to one 
type of anti-diabetic treatment (sulphonyl- 
ureas) 31 , representing a genuine example of 
pharmacogenetics, although it is too early to 
say whether this will be clinically relevant. 

Finally, there are more genes to be found. 
The current findings explain only a small 
proportion of the excess familial risk. The 
positions of the four variants found from 
candidate-gene studies in genome-wide data 
tell a tantalizing story. Common variants 
underlying the PPARG, KCNJ11 and WFS1 
associations occurred at positions 786, 799 
and 26,017, respectively, in the UK-based 
genome- wide scan of 1,924 cases and 2,938 
controls, and the TCF2 risk variants were 
not captured by the Affymetrix chip (TABLE 1 ). 
Because each of the GWAS followed up on 
only a few of the top signals, there must be 
many more signals to uncover. Extensive 
collaboration between groups to assemble 
sample sizes of an order of magnitude of tens 
of thousands of cases and controls will be 
necessary to hunt down risk alleles with odds 
ratios of -1.10 or less, but which are of no 
less importance iron; an aetiological point of 
view. This is possible. Three of the five type 2 
diabetes studies set a benchmark by sharing 
their summan statistics befon ptibhea 
tion 16 ' 17,20 . These studies identified the three 
novel regions found by the other two studies 
(SLC30A8, HHEX/IDE and CDKALl) but 
the /( ;F2HP2 and CDKN2A/2B regions were 
revealed only through combining genome- 
wide data from the total of -4,500 eases and 
-5,500 controls. The study that went on to 
describe the effect of the FTO gene in the 
general popuiat i total of -39,000 

samples; a recent study of breast cancer used 

ntrols to 
ide 1 ' \ nsk variants 



Conclusion 

The first wave of type 2 diabetes GWAS has 
provided convincing evidence for six novel 
gene regions involved in the condition. In 
many cases, the genes in these regions were 



u pr< 
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rating their association, the studies 
uncovered new aetiological pathways to 
be explored. This should ultimately lead 
to improved prevention and treatment for 
patients. The total number of type 2 diabetes- 
associated gene regions is now 1 1, but many 
more, probably of small effect, remain to be 
found. Now that sample sizes are in plac 
technology is not prohibitively expensive, 
the best way ot id ig > litional genes will 
be through collaboration to assemble samplt 
sizes of > 10,000 cases and controls. 
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[ Genome-wide association (GWA) studies have identified 
j multiple loci at which common variants modestly but 
L reproducibly influence risk of type 2 diabetes (T2D) 1 "" 11 . 
j Established associations to common and rare variants explain 
i only a small proportion of the heritability of T2D. As 
? previously published analyses had limited power to identify 
• variants with modest effects, we carried out meta-analysis of 
! three T2D GWA scans comprising 10,128 individuals of 
i European descent and ~2.2 million SNPs (directly genotyped 
; and imputed), followed by replication testing in an 
i independent sample with an effective sample size of up to 
i 53,975. We detected at least six previously unknown loci 
with robust evidence for association, including the JAZF1 
[ (/»= 5.0 x 10- 14 ), CDC123-CAMK1D(P = 1.2 x 10" 10 ), 
J TSPAN8-LCR5 (P = 1.1 x 10" 9 ), THADA (P = 1.1 x 1(T 9 ), 
ADAMTS9 (P = 1.2 x 10~ 8 ) and NOTCH2 (P = 4.1 x 10" 43 ) 
gene regions. Our results illustrate the value of large discovery 
and follow-up samples for gaining further insights into the 
inherited basis of T2D. 

GWA studies are unbiased by previous hypotheses concerning candi- 
date genes and pathways, but they are limited by the modest effect 
sizes of individual common susceptibility variants and the need for 
stringent statistical thresholds. For example, the largest allelic odds 
ratio (OR) of any established common variant for T2D is ~ 1.35 
(TCF7L2), and the nine other validated associations to common 
variants (excluding FTO, which has its primary effect through obesity) 
have allelic ORs between 1.1 and 1.2 (refs. 1-6,11,12). To augment 
power to detect additional loci of similar or smaller effect, we 
increased sample size by combining three previously published 
GWA studies (Diabetes Genetics Initiative (DGI), Finland-United 
States Investigation of NIDDM Genetics (FUSION) and Wellcome 



Trust Case Control Consortium (WTCCC)) 1 ^ 1 , and extended SNP 
coverage by imputing untyped SNPs on the basis of patterns of 
haplotype variation from HapMap 13 (Table 1). 

We started with a set of genotyped autosomal SNPs that passed 
quality control filters in each study: in WTCCC, 393,143 SNPs from 
the Affymetrix 500K chip (minor allele frequency (MAF) > 0.01; 
1,924 cases and 2,938 population-based controls 3 ' 4 ); in DGI, 378,860 
SNPs from the Affymetrix 500K chip (MAF > 0.01; Swedish and 
Finnish sample of 1,464 T2D cases and 1,467 normoglycemic 
controls, including 326 discordant sibships 1 ); and in FUSION, 
306,222 SNPs from the Illumina 317K chip (MAF > 0.01, 1,161 
T2D cases and 1,174 normal glucose- tolerant controls from Finland 2 ) 
(Supplementary Table 1 online). 44,750 SNPs (MAF > 0.01) 
were directly genotyped in all three studies across the two platforms. 
We used data from the GWA studies and phased chromosomes from 
the HapMap CEU sample to impute autosomal SNPs with MAF > 
0.01 (ref. 14; see also URLs section in Methods). We based our 
further analyses on 2,202,892 SNPs that met imputation and geno- 
typing quality control criteria across all studies (Supplementary 
Methods online). 

Using these directly measured and imputed genotypes, we tested for 
association of each SNP with T2D in each study separately, corrected 
each study for residual population stratification, cryptic relatedness or 
technical artifacts using genomic control, and then combined these 
results in a genome-wide meta-analysis across a total of 10,128 
samples (4,549 cases and 5,579 controls; Supplementary Methods). 
We calculated that this sample size provides reasonable power to 
detect additional variants with properties similar to those previously 
identified through less formal data combination efforts 1,2,4 (Supple- 
mentary Table 2 online). Unless otherwise indicated, results presented 
are derived from individually genomic control-adjusted stage 1 
results. We obtained meta-analysis OR and confidence intervals 
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Table 1 Overview of study design 



Study 


Cases (n) a 


Controls (n) a 


Effective sample size 3 


Number of directly genotyped SNPs b 


Number of imputed SNPs b 


Stage 1 


DGI 


1,464 


1,467 


2,521 


378,860 


1,888,145 


WTCCC 


1,924 


2,938 


4,706 


393,143 


1,915,393 


FUSION 


1,161 


1,174 


2,335 


306,222 


2,110,199 



g, DGI stage 2 5,065 5,785 9,874 63 

£ FUSION stage 2 1,215 1,258 2,473 59 

•g UK stage 2 3,757 5,346 9,114 66 



Stage 3 

deCODE 
KORA 

HUNT 



EPIC 

ADDITION/Ely 



rs9472138, and a perfect proxy (rs2793831, based 
a subset of the GWA scan samples (numbers indicat 
each study (quality control criteria: SNPTEST inform 
imputed SNPs passing quality control in all three sti 



5,043 
1,503 
2,014 



1,610 
2,400 
2,639 



0(3,130) 
2,684 
8,690 
2,412 
3,468 
1,070 
1,036 



stage 3 study, we used genotype data from the Icela 
emaining SNPs had not been directly typed as part c 
(Supplementary Methods). b Autosomal SNPs passing quality control, a: 
>0.5; fSfiat > 0.3; MAF > 0.01). For the st " 



lie GWA scan 5 for rs2641348, rs7578597 and 
his scan and were therefore genotyped separately, 
lefined for directly genotyped and imputed SNPs 
led results for 2,202,892 directly genotyped and 



g from a fixed-effects model, and P values from a weighted z statistic— 
g based meta-analysis (Supplementary Methods). As expected, the 
© most significant result was obtained for rs7903146 in TCF7L2, We 
also observed evidence for association (P < 10~ 3 ) at eight of the ten 
jghk established T2D loci (as well as at the FTO obesity locus) 12 (Supple- 
mentary Table 3 online). This was unsurprising, as these same data 
~ supported the identification of many of these loci. As our goal was to 
identify previously unknown loci, we excluded 1,981 SNPs in the 
immediate vicinity of these T2D susceptibility loci from further 
analysis (with the exception of a signal near PPARG, which was 
followed up), and examined the remainder of the autosomal genome 
(Supplementary Methods). Even after excluding known loci, we saw a 
strong enrichment of highly associated variants: 426 with P values 
< KT 4 , compared to 217 under the null. 

Before proceeding to follow-up, we explored the individual studies 
and the combined data for potential errors and biases. We found a 
genomic control / value of 1.04 for the combined results (based on 
10,128 samples), which, given the relationship between A and sample 
size 13 , suggests little residual confounding (Supplementary Fig. 1 and 
Supplementary Note online). We also used genome-wide genotype 
data to estimate the principal components of the identity-by-state 
relationships in each stage 1 sample. For the SNPs presented in 
Table 2, adjustment for principal components in stage 1 T2D 
association analysis did not diminish the association in the WTCCC 
(two principal components), FUSION (ten principal components) or 
DGI (ten principal components) samples (Supplementary Note). 
Additionally, we did not find any evidence for association between 
UK population ancestry informative markers 3 and disease status in the 



UK replication sets (Supplementary Note). To ensure that the 
observed stage 1 associations taken forward to follow-up were not 
due to imputation errors, we directly genotyped originally imputed 
variants in the stage 1 samples (Supplementary Methods). We found 
strong agreement between the genotype-based and imputed P values: 
in 38 of 43 cases where a direct genotype-based result was obtained, 
the P value was within one order of magnitude of that derived from 
imputation, and in the remaining five cases, P values were less than 
two orders of magnitude different (Supplementary Table 4 online). 

We selected SNPs for replication principally on the basis of the 
statistical evidence for association in stage 1, excluding SNPs with 
evidence for heterogeneity of ORs (P < 10" 4 ) across studies (Supple- 
mentary Methods). We took 69 SNPs forward to an initial round of 
replication (stage 2) in up to 22,426 additional samples of European 
descent (Table 1 and Supplementary Table 1). The distribution of 
association P values in stage 2 was highly inconsistent with a null 
distribution. Of the 69 signals selected for follow up, 65 were 
successfully genotyped in stage 2, and represented loci that were 
independent of each other and of previously established susceptibility 
loci. Nine of these had a P value <0.01 with association in the same 
direction as the original signal, far in excess of the 0.33 expected under 
the null (P = 1.4 x 10~ 12 , binomial test; Supplementary Methods), 
and two SNPs had P < lCr 4 as compared to an expectation of 0.0033 
(P = 5.2 x 10"*) (Supplementary Methods and Supplementary 
Table 5 online). 

We identified 1 1 SNPs (ten separate signals, nine of which represent 
previously unknown loci) with P < 0.005 in stage 2 for which the 
combined stage 1 and stage 2 data (based on direct genotyping of stage 
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a Figure 1 Regional plots of six confirmed associations, (a-f) For each of the JAZF1 (a), CDC123-CAMK1D (b), TSPAN8-LGR5 (c), THADA (d), ADAMTS9 (e) 

H and NOTCH2-ADAM30 (f) regions, genotyped and imputed SNPs passing quality control across all three stage 1 studies are plotted with their meta-analysis 

a P values (as -log lo values) as a function of genomic position (NCBI Build 35). In each panel, the SNP taken forward to stages 2 and 3 is represented by a 

g blue diamond (meta-analysis P value across stages 1-3), and its initial P value in stage 1 data is denoted by a red diamond. Estimated recombination rates 

| (taken from HapMap) 13 are plotted to reflect the local LD structure around the associated SNPs and their correlated proxies (according to a white to red 

jjj, scale from r 2 = 0 to r 2 = 1; based on pairwise r 2 values from HapMap CEU) 13 . Gene annotations were taken from the University of California Santa Cruz 

c Genome Browser. 



£ 1 samples, where previously imputed) generated P < ICr 5 . We further 
« genotyped these 11 SNPs in up to 57,366 additional samples (14,157 
3 cases and 43,209 controls) of European descent in stage 3 (Table 1, 
Z Supplementary Table 1 and Supplementary Methods). The distribu- 
g tion of P values for these 1 1 SNPs was again inconsistent with a null 
g distribution: all nine newly identified and independent SNPs had 
© effects in the same direction as in the stage 1 + 2 meta-analysis 
(P = 0.002), and seven had P < 0.05 in the direction of the original 
{^association (P = 2.1 x 1(T 10 ) (Table 2). 

On the basis of the combined stage 1-3 analyses, we found that six 
signals reached compelling levels of evidence (P = 5.0 x 10~ 8 or 
better) for association with T2D (Table 2). As in all linkage disequili- 
brium (LD)-mapping approaches, characterization of the causal 
variants responsible, their effect sizes and the genes through which 
they act will require extensive resequencing and fine-mapping. How- 
ever, on the basis of current evidence, we found that the most 
associated variants in each of these signals map to intron 1 of 
JAZF1, between CDC123 and CAMK1D, between TSPAN8 and 
LGR5, in exon 24 of THADA, near ADAMTS9 and in intron 5 
of NOTCH2. 

The strongest statistical evidence for a new association signal was 
for rs864745 in intron 1 of JAZF1 (Fig. 1), one of a cluster of 
associated SNPs with strong evidence for association in the stage 1 
meta-analysis and across each replication sample (Table 2 and 
Supplementary Table 6 online). The overall estimate of effect was 
an OR of 1.10 (95% CI = 1.07-1.13; P = 5.0 x \0r H under an 
additive model), based on 68,042 individuals. JAZF1 (juxtaposed with 
another zinc finger gene 1) encodes a transcriptional repressor of 
NR2C2 (nuclear receptor subfamily 2, group C, member 2) 16 . Mice 
deficient in Nr2c2 show growth i da v IGF1 serum concen- 

trations and perinatal and early postnatal hypoglycaemia 17 . Very 
recently, a SNP in JAZF1 was identified as associated with prostate 



; this is particu 
i HNF1B ; 



ly interesting given the recent finding 
: also associated both with T2D and 



that SNPs 
prostate cancer 19,20 . 

The second strongest statistical evidence for a new signal was for 
rsl2779790 (combined OR =1.11, 95% CI = 1.07-1.14, P = 1.2 x 
10~ 10 ), which lies in an intergenic region ~90 kb from CDC123 (cell 
division cycle 123 homolog (S. cerevisiae)) and ~63.5 kb from 
CAMK1D (calcium/calmodulin-dependent protein kinase ID) 
(Fig. 1, Table 2 and Supplementary Table 6). CDC123 is regulated 
by nutrient availability in S. cerevisiae and has a role in cell cycle 
regulation 21 . Evidence from previous GWA studies implicating var- 
iants in CDKAL1 and near CDKN2A/B in T2D predisposition suggests 
that cell cycle dysregulation may be a common pathogenetic 
mechanism in T2D 1,2 ' 4 . 

The third strongest statistical signal was found for rs7961581, which 
resides upstream of TSPAN8 (tetraspanin 8; combined OR = 1.09, 
95% CI = 1.06-1.12, P = 1.1 x lO" 9 ) (Fig. 1, Table 2 and 
Supplementary Table 6). Tetraspanin 8 is a cell-surface glycoprotein 
expressed in carcinomas of the colon, liver and pancreas. 

The fourth strongest new association signal was found for 
rs7578597, a nonsynonymous SNP (T1187A; combined OR = 1.15, 
95% CI = 1.10-1.20, P = 1.1 x 10~ 9 ) that resides in exon 24 of the 
widely expressed THADA (thyroid adenoma associated) gene (Fig. 1, 
Table 2 and Supplementary Table 6). Disruption of THADA by 
chromosomal rearrangements (including fusion with intronic 
sequence from PPARG) is observed in thyroid adenomas 22 . The 
function of THADA has not been well characterized, but there is 
some evidence to suggest it may be involved in the death receptor 
pathway and apoptosis 23 . 

rs4607103 (combined OR = 1.09, 95% CI = 1.06-1.12, P = 1.2 x 
representing a cluster of associated SNPs, resides —38 kb up- 
stream of ADAMTS9 (ADAM metallopeptidase with thrombospondin 
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type 1 motif, 9), and is the SNP with the fifth strongest signal (Fig. 1, 
Table 2 and Supplementary Table 6). ADAMTS9 is a secreted 
metalloprotease that cleaves the proteoglycans versican and aggrecan, 
and it is expressed widely, including in skeletal muscle and pancreas. 

The sixth strongest signal is marked by rsl0923931, which resides 
within intron 5 of NOTCH2 (Notch homolog 2 (Drosophila); com- 
bined OR = 1.13, 95% CI = 1.08-1.17, P = 4.1 x 1(T 8 ) (Fig. 1, 
Table 2 and Supplementary Table 6). We also followed up on 
rs2641348, a nonsynonymous SNP (L359P) within the neighboring 
« gene ADAM30 (ADAM metallopeptidase domain 30) that represents 
••s the same signal (r 2 = 0.92 based on HapMap CEU data), but we found 
1 that its overall signal (combined OR = 1.10, 95% CI = 1.06-1.15, 
j? P = 4.0 x 10~ 7 ; Table 2) was slightly weaker. NOTCH2 is a type 1 
3 transmembrane receptor; in mice, Notch2 is expressed in embryonic 
c ductal cells of branching pancreatic buds during pancreatic organo- 
o genesis, the likely source of endocrine and exocrine stem cells 24 . 
J; The strength of the association evidence for the remaining four 
a variants taken into stage 3 did not meet our prespecified threshold of 
c P < 5.0 x 10" 8 . However, it is likely (based on individual significance 
| values and their overall distribution) that several of these variants also 
j§ represent genuine association signals. In all, three of these additional 
a SNPs showed P values < 1CT 5 across the combined data (Table 2), and 
J= two had P < 0.05 in stage 3 in the same direction as in stages 1 and 2. 
o. Variants near DCD (dermcidin) showed evidence for association 
g (rsll53188; overall P = 1.8 x 10~ 7 ) (Supplementary Fig. 2 online). 
O A signal in VEGFA had previously been noted in the WTCCC GWA 
c scan 4 , but it showed inconsistent evidence for replication: further 
studies will be required to establish its status. We also found associa- 
3 tion at rsl7036101, ~44 kb downstream of SYN2 (synapsin II) and 
£ 115.3 kb upstream of the established T2D susceptibility variant 
g> rsl801282 (P12A) in PPARG (r 2 = 0.54 in HapMap CEU) (Supple- 
3 mentary Fig. 3 online). Conditional analyses in stage 1 + 2 samples 
Z could not differentiate between the effect of these two SNPs (Supple- 
g mentary Note and Supplementary Table 7 online), 
g None of the 11 SNPs (Table 2) were convincingly associated with 
© body mass index (BMI) (Supplementary Table 8 online) or other 
T2D-related traits (with P < 10"~ 3 ) (Supplementary Table 9 online). 
1^, The largest fold-change in T2D association P values before and after 
adjusting for BMI was for rsl7036101 (P = 8.1 x 10 s before 
•^adjustment and P = 7.5 x Iff" 6 after adjustment for BMI; Supple- 
mentary Table 10 online). Conditioning on the associated SNP that 
was taken forward to stages 2 and 3 in each region showed no 
additional independent association signals (P < lOr 4 ) in stage 1 
data (Supplementary Note and Supplementary Fig. 4 online). 

By combining three GWA scans involving 10,128 samples 
(enhanced through imputation approaches) and undertaking large- 
scale replication in up to 79,792 additional samples, we identified six 
additional loci that apparently harbor common genetic variants 
influencing susceptibility to T2D. These findings are consistent with 
a model in which the preponderance of loci detectable through the 
GWA approach (using current arrays and indirect LD mapping) have 
modest effects (ORs between 1.1 and 1.2). Given such a model, our 
study (in which we followed up only 69 signals out of over 2 million 
meta-analysed SNPs) would be expected to recover only a subset of the 
loci with similar characteristics (that is, those that managed to reach 
our stage 1 selection criteria). Further efforts to expand GWA meta- 
analyses and to extend the number of SNPs taken forward to large- 
scale replication should confirm additional genomic loci, as should 
targeted analysis of copy number variation. However, the present data 
provide only crude estimates of the overall effect on susceptibility 
attributable to variants at these loci. The effect of the actual common 



ill i in for tl i social ce i ed 

will typically be larger, and many of these loci are likely to carry 
additional causal variants, including, on occasion, low-frequency 
variants of larger effect: three genes with common variants that 
influence risk of T2D were first identified on the basis of rare 
mendelian mutations (in KCNJU, WFS1 and HNF1B). Regardless of 
effect size, these loci provide important clues to the processes involved 
in the maintenance of normal glucose homeostasis and in the 
pathogenesis of T2D. 

METHODS 

Stage 1 samples, genome-wide genotyping and quality control. An expanded 
description of these methods is provided in Supplementary Methods. 

The WTCCC stage 1 sample consists of 1,924 T2D cases and 2,938 
population controls from the UK 3 '* 1 . These samples were genotyped on the 
Affymetrix GeneChip Human Mapping 500K Array Set. The call frequency of 
included samples was >0.97. In total, 393,143 autosomal SNPs passed quality 
control criteria (Hardy-Weinberg equilibrium (HWE) P > KT 4 in T2D cases 
and controls; call frequency >0.95, MAF > 0.01 and good clustering■ , • 4 ). 

The DGI stage 1 Swedish and Finnish sample consists of 1,464 T2D cases 
and 1,467 normoglycemic controls. Of these, 2,097 are population-based T2D 
cases and controls matched for body mass index (BMI), gender and geographic 
origin, and 834 are T2D cases and controls in 326 sibships discordant for T2D 1 . 
These samples were genotyped on the Affymetrix GeneChip Human Mapping 
500K Array Set, and all included samples had a genotype call rate > 0.95. In 
total, 378,860 autosomal SNPs passed quality control criteria (call frequency 
>0.95, HWE P > ICT 6 in controls and MAF > 0.01 in both population and 
familial components) 1 . 

The FUSION stage 1 sample consists of 1,161 Finnish T2D cases and 1,174 
Finnish normal glucose-tolerant controls 2 . In addition, we included 122 
FUSION offspring with genotyped parents for quality control purposes and 
quantitative trait analysis Samp! ot ped with the Illumina Human- 

Hap300 BeadChip (vl.l). All samples included had a call frequency >0.975. In 
sum, 306,222 autosomal SNPs passed quality control (HWE P > 10 -6 in the 
total sample, < 3 combined du[ ite or n 1 i i rn out 

of 79 duplicate samples and 122 parent-offspring sets), call frequency ^0.90 
and MAF > 0.01) (ref. 2). 

Analysis of stage 1 genotype data. In combining data across the three studies, 
we did not attempt, given differences in study design and implementation, to 
harmonize every aspect of individual study analysis and quality control. For the 
UK, DGI and FUSION studies, respectively, 393,143, 378,860 and 306,222 SNPs 
were analyzed under an additive model. The genomic control values for these 
directly genotyped SNPs were 1.08 (UK), 1.06 (DGI) and 1.03 (FUSION) 
(Supplementary Methods). 

Stage 1 imputation and T2D analysis. For each stage 1 sample set, we imputed 
genotypes for autosomal SNPs that were present in HapMap Phase II but that 
were not prevent in the genome-wide chip or that did not pass direct 
genotyping quality control. In each sample, genotypes were imputed using 
the genotype data from the GWA chips and phased HapMap II genotype data 
from the 60 CEU HapMap founders. We retained SNPs that had an estimated 
MAF > 0.01 in the control or total sample. Imputed SN'Ps were then tested for 
T2D association. The genomic control values for these imputed SNPs were 1.08 
(UK), 1.07 (DGI) and 1.04 (FUSION) (Supplementary Methods). 

Stage 1 meta-analysis. An expanded description of these methods is provided 
in Supplementary Methods. We used meta-analysis to combine the T2D 
association results for the stage 1 WTCCC, DGI and FUSION samples. The 
combined stage 1 data are comprised of 10,128 samples: 4,549 T2D cases and 
5,579 controls We used assoc i I I genotyped SNPs, 

where available, and imputed genotype asso in mi suit t all other positions. 
2,202,892 genotyped and imputed autosomal SNPs passed quality control and 
had MAF > 0.01 in each of th th I i < 

three samples, 308 6 5 c I pi SO were genotyped 

in one sample, and 1,599,177 were imputed in all samples). All association 
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result-, were expressed relative ti> the for wai d strand of the letercnce genome 
based on dbSNP125. In our initial analysis, which was used to select signals for 
stage 2 genotyping, for each SKP we combined the ORs for a given reference 
allele weighted by the confidence intervals using a fixed effects model. We 
i in foi hett o enein of ORs using two commonly used 
statistics: Cochrans's Q statistic and I 2 (ref. 25). 

We repeated the meta-analysis, combining evidence for association solely on 
the basis of the P values. Specifically, for each study, we converted the two-sided 
P value to a z statistic that was signed to reflect the direction of the association 
given the reference allele. Each z score was then weighted; the squared weights 
in were chosen to sum to 1, and each sample-specific weight was piopoitional to 
33 the square root of the effective number of individuals in the sample. We 
g summed the weighted z statistics across studies and converted the summary 
z score to a two-sided P value 

j| SNP prioritization for stage 2 genotyping. We prioritized 69 SNPs for 

£ replication in stage 2 on the basis of the results from the three-study stage 1 

° meta-analysis, using a set of criteria we developed as part of a heuristic 

£ approach to the prioritization of loci for follow-up (Supplementary Methods). 

3 We considered SNPs with a meta-analysis P value < KT 4 and a meta-analysis 

c . heterogeneity P value > ICT*. These selections were largelv made using the 

I initial OR-based version of the meta-analysis. We allowed some exceptions to 

a* the above follow-up criteria. 

ii five SNPs wen sel r replicat t in n the ha t their 

H strong association with T2D in the DGI GWA study (two SNPs), association 

O, with T2D and with insulinogenic index in the DGI study (one SNP), and 

g overlap with FUSION or WTCCC (P < 0.05 in DGI and one or both studies; 

q two SNPs). For known T2D loci (TCF7L2, CDKAL1, IGF2BP2, KCNJ11, 

O) HHEXUDE, SLC30A8, CDKN2AIB region, WFS1, HNF1B and FTO), we 

.£ excluded from follow-up all SNPs that resided within the surrounding region, 

in with region boundaries defined by the furthest neighboring SNPs with P values 

S3 remaining ~0.01 (n = 1,981). For the PPARG region, we identified a SNP, 

[jj rs!7036101, with a P value two orders of magnitude lower than the established 

« P12A susceptibility variant, rsl801282, and we took this signal forward to 

3 replication. In total, we took 69 SNPs forward to stage 2 genotyping. 

oo Stage 2 samples, genotyping and analysis. We genotyped the prioritized SNPs 
§ in cases and controls from three UK replication sets (RSI, RS2 and RS3, 
PJ described in ref. 4; Supplementary Table 1 and Supplementary Methods). 

® Genotyping of prioritized SNPs in RSI, RS2 and RS3 was done by KBio- 
sciences. All assays were validated prior to use, using a standard 96-well 
validation plate (KBiosciences) and up to 296 samples from the WTCCC study 
|£Sw Supplementary Methods). Concordance rates between the Affymetrix and 

~* KASPar/TaqMan genotypes (based on up to 296 replicate stage 1 samples i « ere 
97.5% on average. All genotyped SNPs had genotype call frequency rates 
>94% in the replication sets, and no SNPs had HWE P < 0.001 in cases or 
controls. We tested for association with T2D using the Cochran-Armitage test 
for trend. Results from the three replication sets were combined in a Cochran- 
Mantel-Hacns/cl meta analysis framewoik. 

For DGI, wc genotcped the prioritized SNPs in three stage 2 case-control 
samples 1 (Supplementary Table 1 and Supplementary Methods). The prior- 
itized SNPs were genotyped in all DGI stage 1 and 2 samples using the iPI.FX 
Sequenom MassARRAY platform. We used 63 SNPs passing quality control 
(>94% call rate, MAP > 0.01 and HWE P value > 0.001) for association 
testing. We tested for T2D association in each DGI stage 2 case-control 
e 's ii litive geueti Results from 

the three DGI stage 2 .samples were combined using Cochran-Mantel-Haenszel 

For FUSION, we genotyped the prioritized SNPs in a Finnish case-control 
sample (Supplementary Table 1 and Supplementary Methods) using the 
Sequenom Homogeneous Mass EXTEND or iPLEX Gold SBE assays, carried 
out at the National Human Genome Research Institute (NHGRI). In sum, 59 
SNPs had genotype call frequency >94% and HWE P value > 0.001. The 
genotype consistency rate among 56 duplicate samples was 100%, and the 
average call frequency of successfully genotyped SNPs was 97.3%. SNPs were 
analyzed using logisl i th n 

and birth province and an additive model for the genetic effect. 



Comparison of genotypes from imputation and direct genotyping. We 

genotyped a proportion of the prioritized imputed signals in the stage 1 
samples of the three studies, and calculated respective concordance rates 
( Supplementary Methods and Supplementary fable Ul resi pi 
in the main manuscript text are based on diiectlv typed stage 1 data, except 
rs7961581 in FUSION stage 1. 

Combined meta-analysis for stages I and 2. We combined stage 1 and stage 2 
data using both the OR-based and the weighted z score-based meta-analysis 
appioaehes described above. We also assessed our results using random effects 
meta-analysis to better account for any heterogeneity between the studies 
(Supplementary Table 6). Locus-specific and combined sibling relative 

effect size and risk-allele hecjiienev derived from stage 2 replication samples 
only, and under the assumption of allelic and locus independence, 

as described 26 - 27 . 

Stage 3 sample, genotyping and association analysis. We followed up 1 1 SNPs 
(rs2641348, rs!0490072, rs7578597, rsl7036101, rs4607103, rs9472138, 
rs864745, rsl2779790, rsl 153188, rsl0923931 and rs7961581) in stage 3 
samples from the deCODE, KORA, Danish, HUNT, NHS, GEM Consortium 
(CCC, EPIC, ADDITION/Ely, Norfolk) and METSIM studies (Supplementary 
Table 1 and Supplementary Methods). 

Combined meta-analysis for stages 1, 2 and 3. We combined stage 1, 2 and 3 

i i I i i fi ' 

and weighted P value-based z statistic combination across all sample sets) 
described above. We also assessed our results using random effects meta- 
analysis (Supplementary Table 6). We observed some evidence for hetero- 
geneity across studies I the /•' statistic ranged from 0 to 5~.K% depending on the 
SNP), with rs7578597 and rsl0923931 showing the largest fold differences in 
associadon P value between the fixed- and random-effects model analyses. 
Differences in strength of association across studies (leading to evidence for 
heterogeneity) could reflect interesting biological associations that vary from 
study to study depending on subject ascertainment scheme. 

Genomic control. An i 

Supplementary Methods. We adopted two sti t rtii li 

from this study. In the first, we performed GC-correction of data from DGI, 
FUSION and WTCCC before stage 1 meta-analysis. We corrected each 
individual study for the GC inflation observed (directly genotyped and 
imputed data separately), and combined results across studies. We present 
the genome-wide distribution of association statistics in Supplementary Figure 
1. We note that, after study-specific genomic control adjustment, the estimated 
inflation factor for the stage 1 meta-analysis test statistic was 1.04. 

In the second strategy, we combined GC- uncorrected data from DGI, 
FUSION and WTCCC for stage 1 meta-analysis and did not correct 
the meta-analysis test statistics for the overall GC (to guard against over- 
conservativeness in the estimate of strength of association for interesting 
signals). We also present the genome-wide distribution of these statistics in 
Supplementary Figure I. 

For the combination of data across stages 1 , 2 and 3, we also adopted these 
two strategies (of using GC-corrected and GC-uncorrected stage 1 data). In the 
first, we performed individual GC-correction of DGI, FUSION and WTCCC 
stage 1 data before meta-analysis with stage 2 and stage 3 data fan approach 
which may be over-conservative where, as was the case here, none of the T2D- 
associated SNPs had particular hallmarks of stratification) (Supplementary 
Note). In the second, we combined only uncorrected data (except for the 
deCODE data, for which we applied GC correction, given a more marked 
genomic control inflation (GC - 1.3) in that sample;. We present the resulting 
data from both appi I 1 I ctedan uncorrected stage 1 

i ' ii ' i Sop iicmentary Table 6 and a comparison 

n the Supplementary Note. All 
data presented elsewhere in the manuscript reflect the GC-corrected analysis 
strategy outcome. 

Conditional analysis of T2D signals. For each SNP in Table 2, we assessed the 
additive SNP association in the stage 1 and 2 samples before and after including 
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surrounding a specific 1-1) signal, we asses.scd the additive SX1' association in 
i i i ' ble 2 SNP from the same 

region in the model. We analyzed the data and adjusted for covariates for the 
stage I anil stage 2 analysis of each sample. Data were combined across studies 
as described above. The ORs and CIs were calculated using a fixed-effects 
hod. For the 

WTCCC stage 1 samples, we did not have BMI information available for 
~ 1,500 of the population-based controls. We therefore carried out the 
eondition.il BMI analyses In using all T2D cases and only those controls for 
ivhom BMI data were available. 
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